Chapter 2 Introduction

2.1 Learning Objectives

The literal act of doing Data Science is dominated by two major languages: R and Python. They are non-redundant and both offer valuable tools. In this course we will introduce you to both languages and discuss when it’s best if there is a clear best choice. We’ll begin with R and more onto Python in the next section.

Learning a programming language is like learning any new spoken language. Here, our goal is to:

  • Develop a solid understanding of the basic grammar and vocabulary, and
  • Develop your skills in a goal-oriented way, i.e. understand the question your analysis needs to answer.

Before starting, it is worth considering what we mean by data analysis.

2.2 The Text Editor & R Syntax

RStudio’s built-in text editor offers a flexible and convenient way of managing you projects. This project contains many R script files (*.R) and you can open a new R script from the top menu. Make sure that the lower right corner of the text editor states that it is an R script. After you enter any commands, you can execute them from within the text editor using the keyboard shortcut ctrl + Enter (or command + Enter on a Mac).1

For single and multi-line commands, you don’t need to highlight the command or even have the cursor at the end of the line. So don’t waste time fiddling with your mouse! The entire command will be executed, regardless of where the cursor is positioned. If you do highlight only a portion of a long command, only that segment will be executed. That’s pretty useful for finding errors in long commands.

R input and output is presented in mono-spaced font, as shown below.

Exercise 2.1 (R Syntax) Can you guess what the following commands will do? Type then into your text editor and execute them to find out.
# R syntax
n <- log2(8) #The logarithm of 8 to base 2
n

The above command computed \(log_{2}\left(8\right)\) and assigned the result to the object n, which it created on-the-fly. The result is associated with an index position. In this case, there is only one answer (3), and it is at position 1. {#logExample}

The following table will help you to make sense of the above example.

Table 2.1: R’s syntax.
Notation Description
> The R command prompt.
n An object name.
<- The assign operator.
log2() The function to be solved. The solution will be assigned to n.
8 The function’s argument.
# The beginning of the comment.
The logarithm of ... The comment.
[] Refers to the index of the output.
1 The index number.
3 The value at position [1].
Exercise 2.2 (Incomplete commands) So far, so good. But what would happen if you executed the following command? Do you think it will result in an error?
n <- log2(8

The above command doesn’t result in an error because it’s not an incorrect command, it’s an incomplete command. If an incomplete command is executed the > prompt in the console will turn into a + sign to indicate that further information is needed. In this case, it’s better to just go in the console and press the ESC key, correct the command and then reexecute it. Make sure you remove the above incomplete command from your script or you’ll run into problems eventually.

2

2.3 The Assign Operator: Creating Objects

We’ve already encountered our first object (n), operator (<-), function (log2()) and argument (8). We’ll keep returning back to these concepts throughout the workshop.

We used the assign operator, <- to assign a value to an object. You can enter <- using the keyboard short cut Alt + - (also Option + - on a Mac).

Although it is possible, and you will inevitably encounter it, I recommend to never use = as the assign operator. It can lead to confusion, especially for beginners, and is therefore typically considered bad style.3

Just for convenience, I’ll use some very simple, one-letter names for objects. However, in practice you’ll find it very useful to name objects in a meaningful way. Do not use names of pre-existing functions, like data or subset or plot!

Assigning a result to an object does not typically produce any output. If we want to see the contents of an object, we can look in RStudio’s environment pane, or, as we did above, just execute the name of the object, n in the console. When we execute an object name in the console, it’s actually a short cut for:

print(n)

R looks at the class of the object and decides what’s the best way to print it to the screen. We’ll return to classes in a later section.

2.4 R Help Pages: Functions

It’s worthwhile familiarizing yourself with R’s help pages. To get the most out of the help pages, you will need to understand what functions and packages are. Functions will be discussed in detail in section ??. However, the help pages will make more sense when you know the following simple definitions.

  • Functions are commands that take on specific arguments. In this text, functions are written in mono-spaced font followed by brackets: name().

  • Packages are collections of functions.

The easiest way to get help on a function is to type its name into the search box in the help pages. Some other useful commands are:

Table 2.2: Accessing R’s help pages for specific functions and topics.
Command Outcome
help(topic), ?topic() or ?topic Calls the topic() function help page.
help.search("topic") or ??topic Searches all help pages for the word “topic.” This is useful if you’re not sure what function to use.
example(topic) Executes all the commands contained in the examples sub-section of the topic() help page.

For example, the command log opens a help page, with details on how to use the log() function, titled Logarithms and Exponentials4

Exercise 2.3 (R Syntax) Using the commands in table 2.2, search for the R function that calculates the standard deviation. Notice that when you run the example, the command prompt changes.

Familiarize yourself with the different sections of the help page for the function that calculates the standard deviation. Use table to find the command to run the example at the bottom of the help page directly in the R console. You should not have to type or copy the example.

Table 2.3: The anatomy of a help page in R. Other sections may be added by the help page’s author. They are typically self explanatory.
Section Description
Description A description of what the function does in simple language.
Usage An example of how the function is used, showing all its essential and optional arguments.
Arguments A description of the arguments available, plus the specific data types and structures they accept.
Details Points of caution and interest to watch out for.
Value The output returned by the function.
References Publications which describe the function. Particularly useful for modern methods in statistics.
See Also Recommendation for similar functions that may be more appropriate for the tast at hand.
Examples Reproducible example code that demonstrates how to use the function.

For optional arguments, their default values are given. If no default value is given, the argument is essential and it must be provided when called.

2.5 Packages: Extending R’s Functionality

The core R installation is called the base package. This is actually a collection of several packages, each of which has a collection of functions. For example, in your initial installation you’ll have the utils (i.e. utilities), graphics (for basic plotting), and stats (for basic statistics) packages. In addition to this, there are over 12,500 packages in the official repository, CRAN (the comprehensive R archive network) and over 1,500 Bioinformatics-focused packages in the BioConductor repository. On top of all that there are thousands more published on GitHub, an online repository for code of all varieties, and many ore in-house packages not released to the public. All these packages act as functional add-ons to the base package, kind of like extensions in your web-browser. You can install packages directly from CRAN using RStudio’s packages pane. BioConductor and GitHub packages are beyond the scope of this workshop. Table @ref{tab:help-pack} lists the most common functions for working with packages.

Table 2.4: Help using packages. Replace name with the specific package name you are interested in.
Command Outcome
install.packages(name) Install package name.
library(name) or Initialize package name.
library() List all installed packages.
library(help = name) Provides details, including all functions, contained in name.
search() List the search space.

Always install all dependencies when you install a package. Installation can also be done through the R console menu. A package only needs to be installed once, but must be initialized every time you start a new R session. After initialization, you will have access to all the functions within that package. You can access a specific function within a package, without explicitly loading the package, by using the double colon operator, ::.

We’ll use the tidyverse package throughout this workshop. It’s actually a suite of packages. All packages can be considered as a collection of packages because they all have dependencies on other packages. When you install one, you’ll also be installing all the other packages that it depends on.

Exercise 2.4 (Install the tidyverse packages) Install the tidyverse package by using the packages panel in the lower-right pane. Make sure you are installing from the CRAN repository and that the “Install dependencies” option is selected.
Table 2.5: Packages in the tidyverse.
Package Uses
dplyr A grammar of data manipulation.
readr Simplify the import of rectangular data (txt, csv).
tibble A modern version of the data frame.
stringr Simplifies working on strings (characters).
tidyr Create tidy data.
forcats Simplifies working with categorical data.
purrr A toolkit for reiterating functions on elements of various data types.
lubridate Simplifies working with dates and times.
ggplot Create elegant data visualizations.

Exercise 2.5 (R Syntax) Now that we’ve installed the the suite of packages, we can execute it.

You only have to install the package once, but it must be initialized each time we start a new R session. Remember that we typed the following command at the top of our script?
library(tidyverse)

  1. You can also use ctrl + R (or command + R on a Mac), or use the icons to run the line of code where the cursor is positioned.↩︎

  2. Use ↑ and ↓ at the command prompt to scroll through the history of entered commands.↩︎

  3. There is a third option, ->, which we’ll encounter in section @ref(dplyr_arrange)↩︎

  4. Replace topic with the specific function or search term you need help with.↩︎