Chapter 14 Element 7: Tidy Data with tidyr

Learning Objectives

Tidy data is defined as:

  • One observation per row,
  • One variable per column, and
  • One observational unit per data frame.
Table 14.1: The original, tidy, PlantGrowth data frame.
weight group
4.2 ctrl
5.6 ctrl
5.2 ctrl
6.1 ctrl
4.5 ctrl
4.6 ctrl
Table 14.2: A reformatted, non-tidy, PlantGrowth data frame.
trt2 trt1 ctrl
6.3 4.8 4.2
5.1 4.2 5.6
5.5 4.4 5.2
5.5 3.6 6.1
5.4 5.9 4.5
5.3 3.8 4.6
4.9 6.0 5.2
6.2 4.9 4.5
5.8 4.3 5.3
5.3 4.7 5.1

Data should be arranged in a format that makes downstream analysis easier, instead of forcing functions to work on poorly formatted data. We already saw this at the beginning of the workshop, with the PlantGrowth data set (table 14.1). This format allowed us to carry out easy commands like making a boxplot. A more typical way to store this data would be like table 14.2. But this would make it much more difficult to work with. Can you imagine how to draw the same box plot if that data was in this format? It’s possible, but not nice!

Some statisticians estimate that about three-quarters of their time is spent on “data munging”, that is, getting data cleaned-up and tidy so that they can actually analyse it.

14 Tidy Data with the tidyr Package

To understand tidy data, we’ll consider a generic example with a data frame, PlayData (see figure 14.3).

Table 14.3: The PlayData data frame.
type time height width
A 1 10 50
A 2 20 60
B 1 30 70
B 2 40 80

To work with our data in R, we want all variables in their own columns. To achieve this, we will gather our data into a long, tidy form. To understand what this means, we need to realise that there are essentially two different types of variables:

ID Variables are all the possible grouping variables which were measured. These include both independent and dependent categorical variables. ID variables are used to group, i.e. identify, our measurement variables. Our ID variables are type and time.

Measurement Variables are what was measured, here that’s the height and width.

To generate tidy data we use the gather() function from the tidyr package and define our ID variables as a vector of unquoted variable names.

type time key value
A 1 height 10
A 2 height 20
B 1 height 30
B 2 height 40
A 1 width 50
A 2 width 60
B 1 width 70
B 2 width 80

This will convert the remaining column headers into the 3rd ID variable, key, and produce the tidy data frame, shown above. Now we can re-arrange our data by specifying a formula. For example, to return to our original data frame, we can use:

type time height width
A 1 10 50
A 2 20 60
B 1 30 70
B 2 40 80

Likewise, we can spread our data so that each category of time is now a separate variable, as in ??.

type key 1 2
A height 10 20
A width 50 60
B height 30 40
B width 70 80

Or so that type is defined in separate variables

time key A B
1 height 10 30
1 width 50 70
2 height 20 40
2 width 60 80

The three transformation function scenarios are straight-forward, if our starting point is tidy data! But the power of tidy data becomes apparent when grouping our data according to a factor variable. This allows us to apply not only transformation functions, but aggregation functions as well. For this we turn to the dplyr package.