Chapter 14 Element 7: Tidy Data with tidyr
Learning Objectives
Tidy data is defined as:
- One observation per row,
- One variable per column, and
- One observational unit per data frame.
weight | group |
---|---|
4.2 | ctrl |
5.6 | ctrl |
5.2 | ctrl |
6.1 | ctrl |
4.5 | ctrl |
4.6 | ctrl |
trt2 | trt1 | ctrl |
---|---|---|
6.3 | 4.8 | 4.2 |
5.1 | 4.2 | 5.6 |
5.5 | 4.4 | 5.2 |
5.5 | 3.6 | 6.1 |
5.4 | 5.9 | 4.5 |
5.3 | 3.8 | 4.6 |
4.9 | 6.0 | 5.2 |
6.2 | 4.9 | 4.5 |
5.8 | 4.3 | 5.3 |
5.3 | 4.7 | 5.1 |
Data should be arranged in a format that makes downstream analysis easier, instead of forcing functions to work on poorly formatted data. We already saw this at the beginning of the workshop, with the PlantGrowth
data set (table 14.1). This format allowed us to carry out easy commands like making a boxplot. A more typical way to store this data would be like table 14.2. But this would make it much more difficult to work with. Can you imagine how to draw the same box plot if that data was in this format? It’s possible, but not nice!
Some statisticians estimate that about three-quarters of their time is spent on “data munging”, that is, getting data cleaned-up and tidy so that they can actually analyse it.
14 Tidy Data with the tidyr
Package
To understand tidy data, we’ll consider a generic example with a data frame, PlayData
(see figure 14.3).
# Create a new play dataset to work on:
PlayData <- data.frame(type = rep(c("A", "B"), each = 2),
time = 1:2,
height = seq(10, 40, 10),
width = seq(50, 80, 10))
type | time | height | width |
---|---|---|---|
A | 1 | 10 | 50 |
A | 2 | 20 | 60 |
B | 1 | 30 | 70 |
B | 2 | 40 | 80 |
To work with our data in R, we want all variables in their own columns. To achieve this, we will gather our data into a long, tidy form. To understand what this means, we need to realise that there are essentially two different types of variables:
ID Variables are all the possible grouping variables which were measured. These include both independent and dependent categorical variables. ID variables are used to group, i.e. identify, our measurement variables. Our ID variables are type and time.
Measurement Variables are what was measured, here that’s the height and width.
To generate tidy data we use the gather()
function from the tidyr
package and define our ID variables as a vector of unquoted variable names.
type | time | key | value |
---|---|---|---|
A | 1 | height | 10 |
A | 2 | height | 20 |
B | 1 | height | 30 |
B | 2 | height | 40 |
A | 1 | width | 50 |
A | 2 | width | 60 |
B | 1 | width | 70 |
B | 2 | width | 80 |
This will convert the remaining column headers into the 3rd ID variable, key
, and produce the tidy data frame, shown above. Now we can re-arrange our data by specifying a formula. For example, to return to our original data frame, we can use:
type | time | height | width |
---|---|---|---|
A | 1 | 10 | 50 |
A | 2 | 20 | 60 |
B | 1 | 30 | 70 |
B | 2 | 40 | 80 |
Likewise, we can spread our data so that each category of time is now a separate variable, as in ??.
type | key | 1 | 2 |
---|---|---|---|
A | height | 10 | 20 |
A | width | 50 | 60 |
B | height | 30 | 40 |
B | width | 70 | 80 |
Or so that type is defined in separate variables
time | key | A | B |
---|---|---|---|
1 | height | 10 | 30 |
1 | width | 50 | 70 |
2 | height | 20 | 40 |
2 | width | 60 | 80 |
The three transformation function scenarios are straight-forward, if our starting point is tidy data! But the power of tidy data becomes apparent when grouping our data according to a factor variable. This allows us to apply not only transformation functions, but aggregation functions as well. For this we turn to the dplyr
package.