Chapter 3 Importing Data

Now that we’re familiar with some of the basic concepts of functional and object-oriented programming, let’s work on some real data!

3.1 Functions & Packages for Importing Data

It is likely that most of your data is found in an Excel or tab-delimited table, like the text files provided for the workshop. Use these base package functions to import your data and save them as a variable.

Table 3.1: Base package functions for importing data. These functions will handle most of your general input/output needs.
Function Description
read.table(file) Read file in table format into data frame.
read.csv(file) Read comma-separated data.
read.csv2(file) Read semicolon-separated data.
read.delim(file) Read tab-delimited data into data frame.
read.delim2(file) Read tab-delimited data where the decimal place is a ,.

3.1.0.1 stringsAsFactors

This subsection explains some important details. I will probably be more detailed than you need right now, but it’s worth taking an aside to mention it here. Feel free to return once you feel more comfortable with R.

The most relevant detail to consider here will come up again when we discuss data frames in more detail in the Objects chapter. In April 2020. R v4.0 was released and with it came an oft-requested change to the default values for an argument found in these base package import functions. The argument is stringsAsFactors, which defaults to FALSE in R v4.0 and above, but is TRUE in older versions of R. If you use these functions in older versions of R, columns that contain characters will be treated as categorical variables, i.e. they will be identified as a factor class variable, and not as a straight-forward character type variable. We’ll discuss exactly what we mean by class, type and character in the objects chapter and a whole chapter is dedicated to Factors. At the moment, it it unlikely to cause concern since you have a new release of R and you’re not working on scripts that someone else authored using an older version of R.

This behavior was for a good, but out-dated, reason and became a long-standing complaint in the R community. Although the community wanted to have this default behavior changed, it does create some confusion since you may inherit old R scripts from colleagues that are aware of this behavior and utilize factors in their script. If you run those scripts using an upgraded version of R, you may encounter problems.

An alternative to the base package functions given above, the readr package, part of the tidyverse, can make things easier and faster in some cases. In particular, it doesn’t convert characters to factors when importing, so older and newer scripts will work in the same way.

Table 3.2: readr package functions for importing data.
Function Description
read_csv() Read comma-separated data.
read_csv2() Read semicolon-separated data.
read_tsv() Read tab-delimited data.
read_delim() Reads data with any delimiter.

The rio package (R Input/Output) offers the import() function for generic import of many different file formats. This is a good starting point if you are going to read in an Excel file. Nonetheless, I would encourage you to not begin working in R using Excel files. It’s my experience that the Excel files that students like to use are often times contaminated with summary statistics, multiple tables, plots, colors & lines that encode information, multiple worksheets and inconsistent use of white space. All of these things can be dealt with in R, but they make your life more difficult than it needs to be. Learning R is already a lot of work, so make your life easier by starting with a dataset that that is going to be easy to read into R and has not been modified after it was generated.

Exercise 3.1 (Import and Examine) If you set up a new project in RStudio Desktop, then you can move your data files into the data folder on your computer and they will also be in your working directly.

Import the file called martians.txt, found inside the data folder, and save it as an object called martians. See the notes below if your having a hard time.

A very common point of confusion is what should be the argument for any of the import functions given above. Many students just do something like:

read_tsv(martians.txt)

This will produce an error saying that the martians.txt object is not found. That makes perfect sense because you gave an argument pointing to some object in your environment not to a file on your computer (or in this case the cloud). So you need to give the name of the file using "", and more than that, you need to specify which sub-directory it’s in. Try "data/martians.txt"

Exercise 3.2 (Examine structure) What type of data is contained in each column? Use some of the functions we introduced in the dataframes section to explore the basic structure of our new object.

Let’s begin exploring our data by looking at some basic plots and doing some transformations.

3.2 Saving a data frame to a file

There are several ways to save an R object outside of the environment. If you have a data frame, The most straightforward way is to use the rio package:

Function Description
export(df, file) Save the data frame df as a file.

This will save the file type according to the file extension. For example, "myData.txt" will result in a tab-separated file.

Alternatively, you’ll also see base package versions:

Function Description
write.table(df, file) Save the data frame df as a file.
write.csv(df, file) Save the data frame df as a comma-separated file.
write.csv2(df, file) Save the data frame df as a semicolon-separated file.

Try using the form write.table(df, "data.txt", sep = "\t", row.names = F) to produce tab-delimited files. You can see that export() takes care of a lot of the settings for you, just by giving a proper extension to your file.