Modern data scientists often work in a variety of different computing environments, such as virtual machines, which may not include a graphical user interface (GUI, e.g. Windows or Mac OS). Using command line tools productively is one of the most useful skills that an aspiring data scientist can master. You’ll quickly realise the potential for automation and productivity that such tools can bring to your daily work.

Learning objectives

This module will introduce you to some fundamental command line methods and tools. By the end of this module you will:

  • Understand the advantages of using the command-line compared to GUIs. You’ll also learn about cases where working with the command line is the only productive option.
  • Learn how to navigate your filesystem and conduct basic operations from the command line. An additional topic is how to combine those commands together to form a pipeline, by using shell scripts.
  • Discover and learn how to use some popular command-line tools for data science.

Terminal Interfaces

For this section we’ll assume that you are using a Unix-based command line.

For *nix system users

The *nix name refers to Unix-based operating systems. This includes Linux and Mac OS. Use the Terminal application to access your command line. To make the interface a bit nicer, we recommend you use oh-my-zsh.

For Windows users

If you’re using windows, please install an emulator. Git Bash comes with an installation of git, which we’ll get to shortly. Alternatively, you may prefer the recently available Linux Subsystem for Windows.

You may want to consider replacements for the default terminal on your machine. For Mac OS see iterm2 and for Windows see terminator. These tools offer a few features, but probably the most important one is the ability to split your screen, instead of opening multiple windows for everything you want to do. Alternatives to such tools, with a steeper learning curve, but also ones that will be working on a remote machine without a GUI are tmux and screen - the so called terminal multiplexers.

A common issue when starting out work with the command line is the loss of control over a terminal window. If you start a process, the only way (by default) to continue work, while keeping this process ongoing, is to create a new terminal window. A better way to do this is to use those terminal multiplexers we mentioned above, or a terminal client that allows you to split the screen.

Look at the screenshot below to see what a very customized command-line looks like:

You’ll encounter several terms that are often confused, e.g. shell and bash. Shell is the program on your OS that exposes the inner workings of the OS to a user. The terminal is an application allowing the shell interface. Bash, zshell and others are languages for interacting with the shell.

Introduction and motivating factors

The command line is often seen in hacker movies, and seems like magic. After an initial overview of the command line features another often seen reaction is that it is an overkill. “Why would I need to understand how to work with this, just to make a folder, when I can do the same with the mouse in the File Explorer or Finder?” This is a legitimate question, but can be dispelled quickly when one realizes the enormous amount of tools available on the command line and their automation potential.

One of the most basic things you can do from the command-line is to launch applications. For example, let’s say we want to execute Python commands in interactive mode (aka REPL, read-eval-print loop). For this, all you need to do is execute the following command:

python

You’ll be greeted with the following prompt (notice how it changed to >>>):

Python 3.8.10 (default, Jun  2 2021, 10:49:15)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

This is a good time to provide instructions on how to exit a running process. If you try the ESC key, normally this won’t work. The two common options are CTRL + C or CTRL + D.

Now you can play around with Python! Especially when you’re setting up Python with a virtual environment, you might want to check whether the command we just used runs the correct Python - the newly created one in the virtual environment, and not the system Python. For this we have the which command which will give you the PATH which is used:

which python

If you use a system Python, you’ll get something like /usr/bin/python3 as a result. We just mentioned the concept of the PATH, but what is that? The easiest definition is that this is an environmental variable providing the command line with the locations of executable applications. Environmental variables are very similar to variables in your programming language of choice, with the idea that they are always available (more similar to “global” variables). The function for printing things on the command line is echo. We can use this to have a look at the PATH variable. Note that variables in the shell are decorated with the $ symbol:

echo $PATH

This should output many lines to the console. If you read through them, you should be able to recognize some applications that you can run from it. A common thing that will happen when you install software is that you will need to manually append (add to) the $PATH variable the locations of your newly installed software. This is often done through configuration dotfiles, such as .bashrc and .zshrc.

After this theoretical section, let’s go back to practicals. A common use case for launching applications from the command-line is when you are working on a new project, stored in a new folder, and you don’t want to waste time navigating your file explorer, to open it with your IDE. Here you can do the following, for this example we assume you are using VSCode:

vscode .

Don’t miss the . symbol, we’ll explain it later!

The command-line is often used to install software. This is especially common for non-windows operating systems, and for software which is used for development or research purposes. For example, here are the typical commands used to install software on Linux and Mac OS:

# with apt-get
sudo apt-get install python

# With Homebrew
brew install python

Here we see a few key elements that we’ll keep seeing. First, sudo, which means “super user do” or “substitute user do,” basically let’s you run commands as the system administrator (aka sysadmin, or super user). Second, apt-get accesses a repository of packages and install asks to install the specific package listed, in this case the python programming language.

In the second command, we use Homebrew, a popular alternative to apt-get on MacOS and Linux, to accomplish the same task.

Install Homebrew using:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

We can test it by installing one fun package, called cowsay:

sudo apt-get install cowsay

This package has just one function. You can pass it a string, and it will return you a fun output:

cowsay hello
 _______
< hello >
 -------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Of course, not very useful, but illustrates the point!

If you are on windows, Chocolately is a package manager that works like apt-get. For example:

choco install git

Follow the installation instructions here.

Cmder is a command line emulator that allows you to use the same Unix commands that are common on mac and linus systems.

An example

It is best to to illustrate the utility of the command line with a practical example that one can see in a normal data science work. Let’s imagine that we have several thousand files containing data about cats, also in a .csv format. Imagine also that the folder also contains data about hundreds of other animals, and our task is to take just the cats data, combine all cat-related files into one, and have a look at the first rows. It is even hard to imagine how that would even be possible without the command line? Download the data from this URL.

Additional information: CSV

This is probably the most widely seen data format, and the first that you will encounter in your data science career. Hence it is even more important to know how to work with it. Pro tip: don’t open with Excel, it can often corrupt the file, and it is harder to use it than you think.

The first thing that we can use is the * character. In the command line, and in programming in general this is a “wild card,” and is used to designate a variety of characters.

This symbol is used much in the same way in other languages. For example, in SQL statements: "FROM table SELECT " or Python: from pandas import .

# move all files containing a string to a folder
mv *cats-data.csv just_cats/

In this case, we select file names that contain the substring cats-data.csv inside, without caring what comes before that in the file name. For example, this will take both big-cats-data.csv and also small-cats-data.csv. Then we move those to a new folder, called just_cats.

# switch to folder
cd just_cats
# combine all files into one
cat *csv > all_data.csv

This is another step which demonstrates the power of the command line, and also illustrates one of the most fundamental aspects of it - combination of commands together. This allows the user to create a sequence of commands (a pipeline) that complete a task. In this case we use cat to print the complete file data, and > to write that to another file. That file will then contain all the other ones. You can imagine a pipeline as split between different components, every component consisting of one command only. The output of each command becomes the input of the next one, and so on.

And finally a more common command:

# have a look at the first several rows
head all_data.csv

Fundamental principles

Chaining and piping

At this point we should make a distinction between the three operators we can use to combine commands: >, >> and |. Let’s cover them one by one. The first operator, > is used to overwrite a file completely, based on the input of another one.

touch test_file
cat "some content" > test_file
cat test_file

This should result in:

some content

But if we do:

cat "new content" > test_file

Can you guess the output? Yes - it will be updated. So how is this different from using >>? We can use the same example to illustrate this method. Instead of overrwriting the contents of the test_file we can test the other command:

cat "new content" >> test_file

In this case, the content from the previous chunk will be appended to the old one in the file:

cat test_file
some content
new content

So how about the | operator? This is called a “pipe,” and is used to construct the pipelines. Let’s go through a specific example. A common command-line procedure for data science work is creating a virtual environment for the Python setup (this is covered in another module). One thing we can commonly check for is what packages are installed in that specific environment. We can use the pip command to do that:

pip freeze

If you have some packages installed, you’ll get an output like this:

ptyprocess==0.7.0
pyasn1==0.4.2
pyasn1-modules==0.2.1

This shows the package names together with their versions. This list can become very big as a data science project matures, since we normally start to use more and more different packages. How do we go about searching for a specific package in this? We can use the popular grep command line tool, together with the pipe | operator:

pip freeze | grep ptyprocess
ptyprocess==0.7.0

And voila, it shows that the package is installed! We have fed the output of the first command to the input of the second one, creating a mini pipeline with a concrete result. Have a look at the diagram below for a visual overview of this process:

Can you figure out what would be the command to have a look at the last few rows?

Now that you observed a more advanced use case, let’s take a step back and familiarlize ourselves with the fundamental commands available on the command line.

# change a directory
cd [path_to_directory]

You can often use cd not to just move around in an existing project, but in other directories. It is often coupled with path names such as .. or ~/. The first one means “one directory above,” while the second one indicates the root directory of your system.

Can you find out what is the path name that indicates the current directory in the command line? Hint: you have used that in the version control module.

Users, Groups and Permissions

# list contents
ls

You can use this to have a look what is inside of a directory, and since you are not using a graphical interface anymore, where you would automatically see the contents of a folder when you navigate to it, you will be using ls a lot. It has a few useful arguments available, that can expand on the output, and useful if you want to know things such as the author, time of creation or most importantly - file size - such as ls -l. The output of this command can look like the following:

-rw-r--r--   1 username  staff    92 27 May 15:17 LICENSE

Here you can see a lot of information on a single file (in this case LICENSE), including the user - username, the group staff and the permissions -rw-r--r--. Let’s go through those elements one by one.

Starting with the user - this does what you expect it to do - authenticates you as a user on this particular machine. A user can be a part of different groups, with different permissions associated with each. This is very useful if you are configuring a machine collaboratively, and want to have better security. For example it is a common practice to limit the permissions on such systems only to the most needed ones, and the only users that have complete permissions (the so called root user) are the administrators. You can swith to a new user with the following command:

useradd [username]

And then switch to that user with:

su [username] -c command

There are a few other options available (such as the ones configuring the user groups), and you can learn more about them here.

The second concept that we should go through are the file permissions, represented by the several letters in the beginning of the ls- -lloutput. The first three characters correspond to the most important information - the (r)ead, (w)rite and e(x)ecute permissions. The second group of three are the same permissions, but for the group, and the last three are for all other users.

Those are mostly used to protect special files from an accidental or malicious edits. Those files are most commonly dotfiles, or configuration files that are of critical importance. If you have a super user privilege, you can change those permissions by using the following command:

chmod o+wc filename

Often you’ll be mistyping a command. Re-typing long commands can be tedious, but fortunately there’s are methods to navigate the history. The first thing you can do is use the arrow keys on your keyboard to select commands you executed, back in time. But there’s a handier method, which implements fuzzy searching through the complete history of the terminal. You can use CTRL + r and start typing. The fuzzy search means you can type any part of the command and you should be able to find it.

An easier way to change those privileges is to use a numbering scheme instead (you can find one here):

chmod 755 filename

What is the numeric code for the most wide permissions for a file?

A great visual overview of the different file permissions can be seen here:

The command line, especially if you are in sudo mode, can irreversibly destroy a lot of information, and even your system. Here are a few mistakes that can happen.

This command: sudo rm -rf ~./ will remove everything on your machine. The base command rm -rf is dangerous enough - if you don’t pay attention you can easily irreversibly delete files and folders. > filename will empty a file. Usually this can happen when you are typing commands too quickly.

Create a new user, named depoyment-user, add it to a user group called deployment and give all members there all permissions except write.

Most command line tools have some form of documentation available, where you can discover how to use them. Often more obscure things, such as a complete list of command line arguments might not be immidiately obvious, or even easily accessible on the internet, so you can check those help pages right in the command line. To do this use man [command name].

Basic operations

Now lets take a step back and focus on the most common commands:

# print file
cat [filename]

This is a command we have already seen. This is often used when you just want to have a look at the contents of a file without opening it with some program. Careful when using this with large files, since it will start endlessly printing to your screen. If that happens you can always cancel the process by CTRL + C.

# copy file
cp [filename] [new_filename]

Copying in the command line is missing the paste part that you are used from GUIs, and achieves the complete process in one step.

# move/rename file
mv [filename_path] [filename_new_path]

This command can seem a bit counterintuitive, since you are using a move command to rename a file, but this is actually how it works.

# remove file
rm [filename_path]

One of the more dangerous commands on this list, this is how you delete a file. You should be very careful, since deleting things from the command line will NOT move them to a Trash directory, like you are used to in your GUIs.

# make directory
mkdir [directory_name]
# remove directory
rm -rf [directory_path]

At the beginning it is often that users try to use the normal rm command on a directory, but it fails. You have to add the rf flag, which stands for “recursive” and “force” to make sure everything is deleted from the directory and any sub-directories.

# create empty file
touch [filename]
# execute as a super user
sudo [action]

We first introduced this topic in the section on users, groups and permissions. The concept of a “super-user” is a very important one in *nix like systems. You will be mostly using this to authenticate yourself, especially when installing new tools, uninstalling or attempting to do something that can be dangerous, so proceed with caution. This will ask you for the root password.

# see what processes are going on, press 'q' to quit
top

The top command can be used as an alternatice to the Activity Monitor in Mac OS, or the Tasks window in Windows. This will show you all the current running processes on your machine and what resources they consume. This can be very useful if you have a process that is too time consuming, and you want to kill it. In that case you can grab the process id from the output of top, and use the kill command to cancel it.

# clear screen
clear

This command is used to clean up a terminal window, if you have too much output and want to re-focus.

The command line will also be a standard way for you to run Python scripts for example, creating Python virtual environments, working and deploying Docker containers, also using cloud providers. Those methods will be shown in other modules in the course.

Aliasing

Some of the commands that you want to type can be very long. As any developer can attest - the more keystrokes you spend on the computer, the larger the chance for errors and bugs in your software. In those cases it makes sense to create “aliases,” or in other words - shortcuts for the commands you use.

For example, the newer operating systems, such as Ubuntu 20 LTS, do not have Python 2 installed anymore. As you have learned in other parts of this course, the two different versions of Python have been a consistent thorn in the side of data scientists for some years, right until the deprecation of Python 2 in favor of 3. On those newer systems if you type:

python

You’ll get an error: zsh: command not found: python. But if you use python3, all should work and it will start. Still, you can save some typing and confusion by creating a shortcut:

alias python=python3

Now everytime you use python, it will work.

Third-party tools

A great playlist covering many command line tools is available below. It has a special focus on tools that are suitable for Data Scientists and Machine Learning Engineers.

The commands that you have so far used are built-in - that is they come preinstalled with your system. They are useful, but they are just scratching the surface of what is possible with the CLI. There is a multitude of tools that you can use there, and here we will go through a few.

wget

This is one of the best ways to download a file from the internet. You can even use it to download files even when you are using a normal browser such as Google Chrome, since the download might often hang, or you can close the browser window by accident. Example usage:

wget [fileurl]

Use wget to download a file from a web page (i.e. ArXiV paper). Which browser functionality can help you with that?

sed

This library can be used to edit the contents in a file with a very good speed. Unfortunately the syntax to use it is not so easy. You can use it for example to remove all commas from a file:

sed 's/,//g' file  > new_file

:class: tip

Use sed to change the delimiter of a whole file.

grep

This is a tool to search for content within a file. For example if we want to see if we have scikit-learn installed in a virtual environment we can do the following:

pip freeze | grep scikit

How can you use grep to get information on all of the current working processes on your machine related to your browser?

Command-line text editors

When working with remote machnies, such as AWS ec2, you will be required to edit files and write code without a GUI. While there have been developments on using tools such as VSCode to edit remote files, sometimes it’s easier to use a command-line text editor. There are many such tools, but the most popular are nano, vim and emacs. The latter two of those are actually the tools of choice for many programmers, and are extremely powerful (at the cost of a very steep learning curve). A nice way to learn VIM is by checking out the VIM Adventures website - you can learn it by playing a game!

For our purposes we’ll focus on using the basics of vim to edit files. It is more complex than nano but provides some great added functionality, such as synthax highlighting, to make your remote editing work easier.

Vim comes preinstalled on Linux, which is the typical server machine, and you can start it like this:

vim testfile.py

This will create this new file and start editing. Vim has several modes of operation, the default one is the selection mode. You can use this to navigate the code by using the arrows. If you want to enter the insert mode, to edit files, press i. Then you can start typing, as you would normally in a GUI editor. After you’re done, you have to come back to the select mode, by pressing ESC. Then you can use the vim commands to save and exit. This command is :wq. If you want to quit without saving do :q!.

CLI use cases

Interfaces for cloud computing

Another common use case for being familliar with command line tools arrives when you start to work with cloud technologies. Nowadays this is almost always the case, that your data science project will be supported by some cloud infrastructure, be it AWS, Azure or GCP (those are covered in the module on data engineering).

Those services provide interfaces to do work, and you can achieve quite a lot in this “no code” environment, but to be truly prodictive you should learn how to use their command line tools. We’ll go through a common example, by using AWS (the other cloud providers have tools and services that would work in quite a similar way).

We’ll be using the AWS Command-line Interface, which you can set up from the official website (note that you will need an AWS account to make use of this, there’s a free tier available). Let’s see several example for what you can do with such tools.

aws ec2 describe-instances

This command will list what machines are currently turned on at your account, providing a nice overview. Data on AWS is normally stored in folders called “buckets.” You can see the contents of those with the this command:

aws s3 ls s3://mybucket

And finally - we can synchronize the files on your machines to those on the S3 bucket with a command like this:

aws s3 cp myfolder s3://mybucket/myfolder --Recursive

In this case, remember what –recursive means, in the context of the command line? It also applies here. Now, let’s have a look at the motivating case where we would require a command-line text editor to edit files - connecting to a remote machine.

If you are using AWS you have access to the creation of remote machines - called elastic compute, and abbreviated as ec2. You can use the GUI - AWS Console to create one (again, for more instructions on this visit the data engineering module). After that you can connect to it by using ssh, Secure Shell Protocol (you should have a key file to authenticate):

ssh -i /path/my-key-pair.pem my-instance-user-name@my-instance-public-dns-names

Now your terminal prompt will change and you are working from within the remote machine! Now you can do all you want - you can download data from s3, set up git, write code with vim, automate data pipelines with shell scripting (coming up below!) and many other things, which would be harder to do in a GUI environment! More than this - you have almost unlimited resources on the ec2 - which becomes a factor if you are dealing with large amounts of data, or if you are using a weak local machine.

Here you should also see the use case of a terminal multiplexer, such as tmux. While locally you could use a GUI like iterm2, on the ec2 instance you won’t have that luxury, but still would often need to run several commands at the same time - such as starting a Jupyter Notebook and downloading data.

Secure applications of shell

Another very useful application in using the command-line is the ability to store security credentials safely. It is a common beginner’s mistake to store cloud API keys and passwords from within your programs into version control systems such as git. In this case when this code is pushed to a remote repository, these private data will become available to a number of people - possibly the whole internet, if the project is open source. This can be easily avoided by setting enironment variables in shell. The place to do it is in the .zshrc or .bashrc file, depending on which you use. When you open this file, either with your IDE or a command-line text editor, you can add such a line:

export AWS_username="My secret username"
export AWS_password="My secret password"

After this you need to save the file and either restart your terminal, or reload the shell configuration by source ~/.zshrc. You can check if this variable setting worked by printing with ECHO $AWS_username. But how do you use this in your scripts? Let’s cover a Pyhon example of a simple program:

import os

USER = os.getenv('AWS_username')
PASSWORD = os.environ.get('AWS_password')

After this you will be able to use those variables in your code - and commit to version control safely.

Automation with shell scripting

In order to harness the automation power of those tools you can add them to a shell script. Let’s start with the most basic example. Here are the contents of a sample file, called first_shell_script.sh.

First install wget, which is a program to download from the internet:

brew install wget
mkdir data
cd data
wget http://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/SMNI_CMI_TEST.tar.gz
gunzip SMNI_CMI_TEST.tar.gz
mkdir csv_data
mkdir excel_data
mv *csv csv_data/
mv *xlsx excel_data
echo("Done")

And then you can run the file like this:

bash first_shell_script.sh

Can you describe what this shell script would do?

A great course on learning the command line is available on Codecademy (2 hours).

The definitive solution if you want to master the command line tools for data science is the O’Reilly book on the topic, which you can read online here.

Exercises

Write a small shell script which takes the 10th row of a file and stores it into a new one.

Create a pipeline of four commands.

Set two environment variables in your .zshrc or .bashrc and use them in a Python program. For example set your names there, and print them lowercased through Python.

Automate data download and processing of a Kaggle dataset with a shell script. This script should contain several commands. Kaggle provides a command-line interface tool that you can download and install. The first command should create the folder structure for the project, after this the data should be downloaded and extracted, and finally moved to the right folders.

Many data science projects share the same structure, and you as a data scientist will often need to recreate it every time you start a project. Automate this work by creating a shell script that creates the folder structure and files for a typical data science project. You can look at the Cookiecutter Data Science project for inspiration.

How can we use the > operator to store the dependencies of a Python project, within a virtual environment?

Search in Google for data science command-line tools, that might be useful. Install and try such a tool, and make a short presentation on how it works, and why would we want to use it?

Use a command-line text editor of your choice to modify your .zshrc file, to change the theme. An overview of the themes available is on the official website.

Quizzes

What’s the default text editor in unix based systems (unix, linux and Mac OS)?

What is “piping” when referring to the command line?

What is the main reason you might want to learn how to use a command-line text editor, such as vim?

What does the term “recursive” mean in the context of the command-line?

What is the use of the $PATH environmental variable?

Provide two situations where using the command-line is a better idea than a GUI?

What is a scenario, where you might actually prefer to use a GUI instead?

Why would a data scientist want to use the command line to see the first rows of a dataset?