4 An example

It is best to to illustrate the utility of the command line with a practical example that one can see in a normal data science work. Let’s imagine that we have several thousand files containing data about cats, also in a .csv format. Imagine also that the folder also contains data about hundreds of other animals, and our task is to take just the cats data, combine all cat-related files into one, and have a look at the first rows. It is even hard to imagine how that would even be possible without the command line? Download the data from this URL.

Additional information: CSV

This is probably the most widely seen data format, and the first that you will encounter in your data science career. Hence it is even more important to know how to work with it. Pro tip: don’t open with Excel, it can often corrupt the file, and it is harder to use it than you think.

The first thing that we can use is the * character. In the command line, and in programming in general this is a “wild card,” and is used to designate a variety of characters.

This symbol is used much in the same way in other languages. For example, in SQL statements: "FROM table SELECT *" or Python: from pandas import *.

# move all files containing a string to a folder
mv *cats-data.csv just_cats/

In this case, we select file names that contain the substring cats-data.csv inside, without caring what comes before that in the file name. For example, this will take both big-cats-data.csv and also small-cats-data.csv. Then we move those to a new folder, called just_cats.

# switch to folder
cd just_cats
# combine all files into one
cat *csv > all_data.csv

This is another step which demonstrates the power of the command line, and also illustrates one of the most fundamental aspects of it - combination of commands together. This allows the user to create a sequence of commands (a pipeline) that complete a task. In this case we use cat to print the complete file data, and > to write that to another file. That file will then contain all the other ones. You can imagine a pipeline as split between different components, every component consisting of one command only. The output of each command becomes the input of the next one, and so on.

And finally a more common command:

# have a look at the first several rows
head all_data.csv