This module will introduce you to:

Fundamental concepts in version control,
The most popular open source software for version control – git, and,
The most popular online tool – GitHub

Learning objectives

By the end of this module you’ll understand:

How version control systems help you organize both your own work and that of your team.
How to use the most fundamental commands of git in combination with GitHub.

What is Version Control?

Version control is the ability to track changes in files (i.e. different versions of a file). Version control is an essential component in developing modern, large-scale software. These projects involve large teams, with individuals working on specific components of the larger project, all at the same time. Although you’re not working on a large team yet, version control is also useful for tracking your own work, and, as we’ll see, presenting it to the world as part of your data science portfolio.

Why do we need version control?

Most projects are not accomplished in isolation, by one person. When many people are involved in the same piece of work, collaboration challenges can arise. Those can already be observed in non-technical work. All of us have experience with trying to collaborate with people by sending different files via e-mail. In this case we tend to use many different file names such as presentation_final_final_v3.pptx. We know how quickly this can get out of hand and become frustrating, and lead us to spend more time on tooling than on actual work.

This problem becomes even more apparent when collaborating on code. Of course, this is not a new problem and has been addressed by a variety of different tools, all of which fall under the umbrella of “version control systes,” or VCS. By far the most popular version control tool is git. Coincidently, git was created by the same person who created the Linux operating system, Linus Torvalds. Just as Linux is named after its creator, so to is git, but with a lot more tongue-in-cheek humor. git is British English slang for a stupid or worthless person. It’s nice to know that Linus has a sense of humor!

Other popular tools are Mercurial and Apache Subversion - you might encounter those when working with older, legacy code bases.

Installation

At the time of writing, git v2.33.0 is available. You don’t necessarily need the latest version, since the fundamental commands won’t change much. To check if you have git installed, execute the following command in the terminal:

git --version

If git is installed, you’ll see something like:

git version 2.33.0

Mac OS and Linux will already have git installed. For this use homebrew:

brew install git

Windows users will likely have to install git themselves. If you set up Chocolatey in the last section on command line tools, you can execute:

choco install git

Otherwise, go to the official website and follow the installation instructions.

Fundamental git commands

You already know that a folder in your OS GUI is the same as a directory in your terminal. For version control, we’ll refer to a folder/directory as a repository, or repo for short. A repository is just a place where you store things, like an archive. Although version control monitors individual files, it operates on all files in a directory. Thus if we have initialized git in a directory, we can refer to it as a git repository. It just means that git is able to monitor changes to all files in the directory.

There are many commands that you can execute with git. To see a list just execute the following in the terminal:

git

This tells terminal to start the git software, but because we didn’t specify which command to execute, we’ll get a helppage on the available commands and how to use them.

The following tables provide an overview of the most common git commands (the so called “porcelain” commands - high level ones, intended to be human readable). This is mostly for reference, since you’ll only use a handful of commands on a regular basis, such as:

The following commands are useful for managing your repository:

Command	Description
branch	List, create, or delete branches
checkout	Switch branches or restore working tree files
cherry-pick	Apply the changes introduced by some existing commits
clean	Remove untracked files from the working tree
clone	Clone a repository into a new directory
diff	Show changes between commits, commit and working tree, etc
fetch	Download objects and refs from another repository
merge	Join two or more development histories together
mv	Move or rename a file, a directory, or a symlink
rebase	Reapply commits on top of another base tip
reset	Reset current HEAD to the specified state
restore	Restore working tree files
revert	Revert some existing commits
rm	Remove files from the working tree and from the index
stash	Stash the changes in a dirty working directory away

And the following commands are less common:

Let’s get started with a play example. Inside your documents folder, create a folder for our class work and another for experimenting with git:

mkdir -p Documents/Misk_DSI_2021/new_project
cd Documents/Misk_DSI_2021/new_project

Now initialize git in this directory. You’ll do this every time you start a new project.

git init

The git init command will initialize a new git local repository, stored in a hidden folder called .git. You can see it if you are working in the command line with the ls -a function. All code changes are being tracked there, so if you remove that folder, you get rid of all the history. Note that if you are using a more advanced command line setup, such as oh-my-zsh, the command prompt might change to indicate that you are working in a version controlled environment.

Additional information: git cheatsheet The main commands are also listed in this cheatsheet that you can print out and keep as a reference.

Now that we have initialized a git repository, let’s create some code changes. We can use the touch command to create an example empty file:

touch test_file.txt
git status

The git status command is another very often used one. This helps you check what is the current status of the local repository - are there new files, which files are modified or even deleted. We advice you to use this command every time you do something new in git, to make sure you know what is the situation and not make irreversible mistakes.

Some of the most common mistakes that newcomers to the git world result in copying and pasting things from Stack Overflow - the common trial and error method, which in other situations can actually be a productive workflow. The issue with git is that often the more advanced commands can be very hard to understand, with all the different arguments attached. Unfortunately, one can do irreversible damage easily. One suggestion from us is to always work in a separate branch, commit your work often, and avoid commands with the –force flag – those tend to override a lot of safety mechanisms for standard version control workflows.

Blended learning: The Git Book

For a thorough introduction to git and version control, have a look at the Pro Git book, available online for free here.

Let’s have a look at a few other git commands that can be useful. Check out how you have configured git with the following command:

git config --list

This will open a text file in the classic vim editor. Instead of making changes here, just exit vim. To do this first type :. This will allow you to enter commands. The command we want is q for quite. This should bring you back to you terminal window.

Set your user name and email here. Keep it consistent and enter the email that you use for online services. We’ll sign up for GitHub later.

git config user.name "Your name here"
git config user.email "Your e-mail here"

This step you normally need to do the first time you use git on a new machine (often it is a remote reuseable machine, such as in cloud computing). Here, you set up your name and e-mail, so git traces changes back to you as the author.

We haven’t done much to our repo yet, but in case you want to copy it to a new machine, use:

git clone [repo url]

This will clone an entire repo (containing not only the code, but all other version control information) to a new machine (an alternative is to download the .zip file from the Github UI (more on Github below), but the clone option is strongly recommended). You can already see the benefits of .git here - a new person on a team can quickly get access to all the previous history of the project.

Blended learning: Interactive git

If you want to try the commands that we learn here interactively, go ahead to the Katacoda’s online portal here.

Now that we know how to setup and check the status of our git repository, how do we add our changes? Not surprisingly, we use the add command:

git add .

Recall that . refers to the current directory in terminal, and here refers to all files in the current repo that have been modified. You can also use git add -all. Alternatively, you can list files specifically:

git add [filename]

Files that you want to be a part of your codebase you can add with the add command, followed by the file names.

Once you are ready with your work for the time being (i.e. when reaching a milestone in building a feature), you can commit. If you just type git commit you will enter your default command-line editor of choice (normally nano or vim) and you have to type your commit message there. If you want to avoid that, you can use the -m flag appended by your commit message describing what the commit is all about in quotes.

git commit -m "Add a test file"

Always include a message when you commit your changes. Typically the messages should be very short and complete the sentence: “This commit will …”

Data versioning in machine learning projects (35 minutes)

If you added some files by mistake to git, you can reset the git changes after the previous commit and start fresh using:

git reset

This command will show you the commit history:

git log

This is especially interesting if you want to have a look at a previous state of the codebase. You can then copy and paste the commit ID and switch to that, by using the git checkout [sha]. There are some more powerful commands that also take advantage of those ids.

Branches

The git branch command,

git branch

allows the creation of new branches where you can work. It is a common practice in a modern environment to always work on a separate branch, from you main (or less prefered master) branch, and then merge to the main only when your work is complete. One rule is that one feature should always be on one branch. You can use this when learning, since almost everything you do should not contaminate the master branch in your repository, so you can try things out without worrying. Have a look at the figure below for a visual explanation on how branching works. Imagine you are moving from left to right, and the project is progressing through different versions, and different features developed at the same time need to be combined finally into one:

Let’s start with creating a new branch:

git branch new-branch

A note on naming convention. Many engineers would admit that naming things is one of the hardest things to do right in software. While there’s no one correct way to name branches well, a good practice is to use a combination between a feature name, and a ticket (or an issue) if you are using a project management system, such as Jira or Github Projects to organize the work. So a typical branch might look like new-feature/233.

You would still need to go to this branch after creating. You can do this by using the git checkout command. If you want to create and switch to a new branch, you can use the -b flag:

git checkout -b new-branch

Remember, you should bbe able to see on which branch you are either in your code editor, or at the command line. If you are using a framework such as zshell you might have even a nicely formatted prompt! Still, if you are unsure on which branch you currently are working on, or what other branches are available, you can use the following command to have a look:

git branch --list

The current branch should have an asterix * as an indicator. If you want to delete a branch, use this:

git branch -D new-branch

Blended learning: git branching For a nice interactive way to learn how to use git branches, head over to this website.

git merge [branch_name]

This command is used to merge another branch into the currently used one. This often happens on a remote repository (i.e. on Github or Bitbucket) via a Pull Request, or PR for short. Still, you can do that also locally, and normally the time to do it is when you are done with working on your feature on a certain branch and want to add it to the master one.

One of the most dreaded elements of working in a version controlled environment happens when you are merging branches - the merge conflict. Version control systems such as git are quite smart in doing this, but they also have their limitations. If you are collaborating with a few people on the same exact piece of the code, breaking changes occur that cannot be automatically resolved by git and require a human intervention.

The merging process then stops, and you can see >>>>> characters indicating which part of the code is from which branch, and where the breaking lines are. Note, that if you are using some more modern tools, such as the VSCode editor, you can more easily work on those merge conflicts, since they provide you with a nice GUI to fix them. After fixing the merge conflict, normally a commit is done and the issue closed.

git push

With the push command you upload your changes to the remote repository (i.e. GitHub), which we’ll get to in the next section. At the end of the day you should always do that, since it is in a way also a backup in case something happens to your laptop or hard drive, and more importantly - your collaborators will be able to see your work and use it in time.

git pull

The pull command downloads any changes to the code from a remote repository. This is usually done at the start of a working day, when you want to see what your colleagues have been working on, and want to make sure you have the most recent version of the code. Note that this step is another common time point which results in merge conflicts.

git add .
git stash

A common pattern which occurs when collaborating through git is when you want to check out another person’s (or your own) branch. If you try to do this while having changed files (“changes in the working tree” is the text you’ll see in the terminal), those changes will be automatically moved to that branch. They will not be commited, but you have to pay attention if you start working on that branch. For this it makes sense to add individual files to the commits, instead of the typically used git add -A or git add ..

A harder issue to tackle can happen if you have unfinished work on the same file(s) as the other branch. In this case you’ll get a message “commit your changes first.” This sounds good, but what if you don’t want to commit those changes, i.e. because they were part of exploratory work? In this case you have the git stash command. You can imagine this command to be similar to saving a draft. You should notice that in the code chunk we added the git add command. You need to add (“stage”) the files before stash. Those changes will disappear from your working tree and be stored separately, and you are free to check out the other person’s work. When you are done doing that you can restore your work with the following command:

git stash pop

This wiill take the most recent stashed code (there can be several stashes, forming a log of them) and you can continue from where you left off, without worrying about commiting unfinished work.

Not everything should be commited to a git repository. Files which contain sensitive information (such as authentication credentials), virtual environments or large datasets should be ignored.

For this there is a file called .gitignore that you can create in your folder. There you can list all the files and folders that you don’t want added to .git. For more information visit the official documentation.

Blended learning: Learning git and GitHub

A very thorough playlist on learning git and Github is available below.

Github

Working with git can be made much easier in combination with other services. There are tools that provide graphical interfaces that make it easier to collaborate and manage more complex software projects. For this module we’ll be using Github, but there are alternatives. We’ll list them below for more information. Those repository hosting platforms often offer a multitude of other features which are useful for software development, such as Kanban boards for project management, popular open-source directory listings, DevOps tools and others - it makes sense to explore those options.

Github is by far the most popular, especially for open-source projects, has a free plan and is easy to set up. Still, the other tools on this list provide a very similar functionality, and you’ll probably encounter them in your career.

The first thing that you need to do is create an account on Github. After this let’s make a new empty repository! You can do this in several different ways, either going to this URL: https://github.com/new or from the left-side of the home page (once you are logged in):

On the screen for creating a new repository there are several options. The first obvious one is giving the repository a name and setting it either to public or private. Since you are still learning the tool, we recommend going for the private option. In this way only people you add will be able to see your work (you can always change this setting later).

Finally, you have several options for setting up the repository itself. You can add a README.md file already. This is a common best practice (you can do this step manually as well of course). This file will be shown on the page of the repository, and it normally contains instructions and additional information on what the repository is about and how to use it. You can have a look at the repository for this model for an example of a README.md file. The second option allows you to set up which files should not be added to the repository. Check out the subsection for Versioning Data for more on this. The last option allows you to add some of the common software licenses. You can use this in cases where you are familliar with the legal aspects of licensing, also when you want to distribute your code to other people, i.e. making it open source.

After creating the repository you are greeted with a screen giving you options on how to connect your local code to the remote Github location. You can either start from scratch, or add existing code. You can also use the Github interface to add code and content (i.e. you might try to use it to make the README.md file).

Branching and pull requests

Any git branch you have locally can be pushed (uploaded) to the remote Github repository. You will be then able to select the different branches from the interface, and merge changes between them by creating a “pull request” (PR). While changes can be merged on the command line without creating a PR first, for larger chunks of work (think new features), it’s a best practice to create a PR. Let’s see why. First of all you can create a PR from a new branch in the interface as shown here:

As mentioned, Github allows you to see what branches are available in the repository:

Here’s the interface of the creating a PR option. Notice how you can select between which branches you want the merge to happen and add additional context. Here you can also specify to whom this PR is assigned. This is an important element since one of the main purpose for a PR in a collaborative team is for code reviewing. Normally only reviewed work should be allowed to be merged into the main branch.

After the PR is opened, general and specific comments can be added in the next window, since the reviewal process is normally a collaborative, back and forth procedure.

The following “Files changed” tab shows you the “diff” for a specific commit and a specific file. You are able to comment on this level if somethig is not right.

One neat feature here is that you can actually check out the PR locally on your machine. It works as any other branch would. This allows the reviewer to really test the code, and not just read it.

Finally, after you (or someone else) reviews the PR, it can be merged and closed. Then it becomes a part of the git log commit history, for which Githug also has a good overview section:

Merge conflicts

One of the most painful to figure out challenges with using git is dealing with the so called “merge conflicts.” Normally, git is very capable of resolving collaborative work on the same files (this is, after all, one of the motivating factors behind version control in general), but even it sometimes fails. In this case, two commits have changed the same line(s) of code, and a human needs to be involved to decide which one is the correct one. This will happen when you try to pull or push your changes, or when you want to merge two branches. Your terminal will throw an error:

$ git status
On branch main
You have unmerged paths.
(fix conflicts and run "git commit")
(use "git merge --abort" to abort the merge)

Unmerged paths:
(use "git add <file>..." to mark resolution)

both modified:   README.MD

When this happens, the files containng the merge conflicts will be modified, they will look like the following:

$ cat merge.txt
<<<<<<< HEAD
import pandas as pd

data = pd.read_csv("data.csv")
=======
data = pd.read_csv("data.csv", sep="\t")
>>>>>>> new_branch

Those added separators demarcate the lines where there are conflicts (there can be more than one in the same file, also in several different files), and which changes come from which commit. The way to fix it is to delete the code lines which are not needed, also removing the added separator characters (>>>). Modern IDEs and text editors, such as VSCode will provide additional GUI functionality which helps resolve merge conflicts just by clicking (accepting) the correct code chunks.

Forking

One of the features that make tools such as Github great for collaboration is the ability to “fork” a repository. The purpose here is to enable people to work on code, without being explicitly added to the original repository’s team. A typical use case is a large open source project, such as tensorflow. Normally few people would be allowed to have collaborator access rights to the tensorflow repository, but a lot of people would like to contribute. This is why, a common practice for an initial contribute is to fork the tensorflow repository - meaning they copy the code and all associated git history to their own profile. After this they are able to work independently, and only when they are done, they have the option of submitting a pull request to the original repository, so that their changes are added there after review.

To fork a repository you can use the button from the interface:

Playground repository

The best way to test everything you have learned so far is with a real world example. For this we have created a special playground repository for the course. It is available on this URL. Here you can try all of git and Github’s features, without worrying about breaking it, and you can do this collaboratively, with your classmates - much like a real team! Refer to the Exercise section for ideas what to do.

Data Versioning

A common issue faced by data scientists working with git is what to do with datasets. The most common workflow is to have a folder on the local machine which contains all the dataset(s), often in several formats (.csv, .xlsx and others). Often the data scientist would add and commit this folder, together with the code. This makes sense - after all, if we are collaborating we want to work on the same dataset right? We would quickly run into issues here though. While for small files, it might seem to make sense to be stored in git, for example we could track changes to a small .csv file, bigger files in a binary format, such as images or video become an issue. These will take huge amount of space to be tracked, and will make pulling and merging in the repository very time consuming, and in danger of running into merge conflicts.

It becomes clear that the best practice is not to add data files to a git repository. So how do we share and track them instead? Let’s look at the possible options:

Git Large File Storage

One excellent option is supported by Github already, and is very reliable to use. You can install it with the following command:

git lfs install

After this you can track the data files, in this case image data, by using track, together with an expression that you should by now be familliar with:

git lfs track "*.png"

This will make sure that all image files are stored in the repository in the right way, without making issues. The first time you do this, a .gitattributes file will be created, and you will need include it, much like a .gitignore file:

git add .gitattributes

From now on you should be able to add and commit your work without worrying!

git add file.png
git commit -m "Add image data"
git push origin main

Cloud data lake

From the options that we have this can be the hardest to set-up, but the easiest to maintain. It will require a subscription to one of the cloud providers (more on this in the module on Data Engineering), which costs money (there’s a free tier available for some experimentation). The two common services which allow you to store and version data are (AWS S3 and Azure Storage)

Exercises

Git as a tool for individual contributor work. While the main purpose of using git is as a collaboration tool, it becomes useful even if you work alone on a project. It can help you save and track your work. We’ll demonstrate this with an exercise. Follow the next steps:

read through the basics of the markdown language here
create a new repository
create a README.md file
add and commit it
create a bulleted list of your favorite food (one item per commit)
check the log on Github to track the changes

For this next exercise we’ll be using the misk-git-playground repository we covered previously in this module. Here are several exercises you can try on it:

Start collaboratively creating a dataset, such as the one shown here (Github is able to show .csv datasets in a nice table format in the browser, so you can see your work):
Form small teams of 3 people. One should be writing code, the second one documenting it, and the third one reviewing it. Complete such work and merge it together.
Mimic the work of a typical data science project, as we covered in the branching subsection. Use code from other projects available (or take open-source scripts available on Github). Create branches for those features, using a good naming convention, and submit PRs for your colleagues to review.

You learned how to add files one after the other in git with the git add command. How can you apply what you’ve learned in the Command Line Tools section with the power of git and add files with a specific extension, let’s say .csv to git?

Find out what is the difference between a git merge and git rebase.

Try learning a new and more obscure git command, such as cherrypick. How does it work, and for what is it useful?

For one of your projects, when you start to work on a new feature, instead of commiting to master try to work as if you are a part of a distributed team. Create a new branch, push to it and submit a pull request for review.

As developers we often try to reduce the amount of typing we do. There’s a way to do this for git. What is the command to replace the git commit command to something shorter, let’s say git cm?

Look through the documentation and find a way to have a less verbose output of git status.

Merge the current branch with a remote one.

Quizzes

What are three reasons for using a version control system?

Is it a good idea to store raw data files in git? Why?

What are the two commands behind git pull?

What is a pull request (PR), and why would you want to do it?

Is it useful to use git if you are working alone on your projets? If so, how?