8 CLI use cases

8.1 Interfaces for cloud computing

Another common use case for being familliar with command line tools arrives when you start to work with cloud technologies. Nowadays this is almost always the case, that your data science project will be supported by some cloud infrastructure, be it AWS, Azure or GCP (those are covered in the module on data engineering).

Those services provide interfaces to do work, and you can achieve quite a lot in this “no code” environment, but to be truly prodictive you should learn how to use their command line tools. We’ll go through a common example, by using AWS (the other cloud providers have tools and services that would work in quite a similar way).

We’ll be using the AWS Command-line Interface, which you can set up from the official website (note that you will need an AWS account to make use of this, there’s a free tier available). Let’s see several example for what you can do with such tools.

aws ec2 describe-instances

This command will list what machines are currently turned on at your account, providing a nice overview. Data on AWS is normally stored in folders called “buckets.” You can see the contents of those with the this command:

aws s3 ls s3://mybucket

And finally - we can synchronize the files on your machines to those on the S3 bucket with a command like this:

aws s3 cp myfolder s3://mybucket/myfolder --Recursive

In this case, remember what –recursive means, in the context of the command line? It also applies here. Now, let’s have a look at the motivating case where we would require a command-line text editor to edit files - connecting to a remote machine.

If you are using AWS you have access to the creation of remote machines - called elastic compute, and abbreviated as ec2. You can use the GUI - AWS Console to create one (again, for more instructions on this visit the data engineering module). After that you can connect to it by using ssh, Secure Shell Protocol (you should have a key file to authenticate):

ssh -i /path/my-key-pair.pem my-instance-user-name@my-instance-public-dns-names

Now your terminal prompt will change and you are working from within the remote machine! Now you can do all you want - you can download data from s3, set up git, write code with vim, automate data pipelines with shell scripting (coming up below!) and many other things, which would be harder to do in a GUI environment! More than this - you have almost unlimited resources on the ec2 - which becomes a factor if you are dealing with large amounts of data, or if you are using a weak local machine.

Here you should also see the use case of a terminal multiplexer, such as tmux. While locally you could use a GUI like iterm2, on the ec2 instance you won’t have that luxury, but still would often need to run several commands at the same time - such as starting a Jupyter Notebook and downloading data.

8.2 Secure applications of shell

Another very useful application in using the command-line is the ability to store security credentials safely. It is a common beginner’s mistake to store cloud API keys and passwords from within your programs into version control systems such as git. In this case when this code is pushed to a remote repository, these private data will become available to a number of people - possibly the whole internet, if the project is open source. This can be easily avoided by setting enironment variables in shell. The place to do it is in the .zshrc or .bashrc file, depending on which you use. When you open this file, either with your IDE or a command-line text editor, you can add such a line:

export AWS_username="My secret username"
export AWS_password="My secret password"

After this you need to save the file and either restart your terminal, or reload the shell configuration by source ~/.zshrc. You can check if this variable setting worked by printing with ECHO $AWS_username. But how do you use this in your scripts? Let’s cover a Pyhon example of a simple program:

import os

USER = os.getenv('AWS_username')
PASSWORD = os.environ.get('AWS_password')

After this you will be able to use those variables in your code - and commit to version control safely.

8.3 Automation with shell scripting

In order to harness the automation power of those tools you can add them to a shell script. Let’s start with the most basic example. Here are the contents of a sample file, called first_shell_script.sh.

First install wget, which is a program to download from the internet:

brew install wget

Can you describe what this shell script would do?

mkdir data
cd data
wget http://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/SMNI_CMI_TEST.tar.gz
gunzip SMNI_CMI_TEST.tar.gz
mkdir csv_data
mkdir excel_data
mv *csv csv_data/
mv *xlsx excel_data
echo("Done")

And then you can run the file like this:

bash first_shell_script.sh