Chapter 14 Element 7: Tidy Data with tidyr

Learning Objectives

Tidy data is defined as:

One observation per row,
One variable per column, and
One observational unit per data frame.

Table 14.1: The original, tidy, PlantGrowth data frame.
weight	group
4.2	ctrl
5.6	ctrl
5.2	ctrl
6.1	ctrl
4.5	ctrl
4.6	ctrl

Table 14.2: A reformatted, non-tidy, PlantGrowth data frame.
trt2	trt1	ctrl
6.3	4.8	4.2
5.1	4.2	5.6
5.5	4.4	5.2
5.5	3.6	6.1
5.4	5.9	4.5
5.3	3.8	4.6
4.9	6.0	5.2
6.2	4.9	4.5
5.8	4.3	5.3
5.3	4.7	5.1

Data should be arranged in a format that makes downstream analysis easier, instead of forcing functions to work on poorly formatted data. We already saw this at the beginning of the workshop, with the PlantGrowth data set (table 14.1). This format allowed us to carry out easy commands like making a boxplot. A more typical way to store this data would be like table 14.2. But this would make it much more difficult to work with. Can you imagine how to draw the same box plot if that data was in this format? It’s possible, but not nice!

Some statisticians estimate that about three-quarters of their time is spent on “data munging”, that is, getting data cleaned-up and tidy so that they can actually analyse it.

14 Tidy Data with the `tidyr` Package

To understand tidy data, we’ll consider a generic example with a data frame, PlayData (see figure 14.3).

# Create a new play dataset to work on:
PlayData <- data.frame(type = rep(c("A", "B"), each = 2),
                       time = 1:2,
                       height = seq(10, 40, 10),
                       width = seq(50, 80, 10))

Table 14.3: The PlayData data frame.
type	time	height	width
A	1	10	50
A	2	20	60
B	1	30	70
B	2	40	80

To work with our data in R, we want all variables in their own columns. To achieve this, we will gather our data into a long, tidy form. To understand what this means, we need to realise that there are essentially two different types of variables:

ID Variables are all the possible grouping variables which were measured. These include both independent and dependent categorical variables. ID variables are used to group, i.e. identify, our measurement variables. Our ID variables are type and time.

Measurement Variables are what was measured, here that’s the height and width.

To generate tidy data we use the gather() function from the tidyr package and define our ID variables as a vector of unquoted variable names.

PlayData %>% 
  gather(key, value, -c(type, time)) -> PlayData.t

type	time	key	value
A	1	height	10
A	2	height	20
B	1	height	30
B	2	height	40
A	1	width	50
A	2	width	60
B	1	width	70
B	2	width	80

This will convert the remaining column headers into the 3rd ID variable, key, and produce the tidy data frame, shown above. Now we can re-arrange our data by specifying a formula. For example, to return to our original data frame, we can use:

PlayData.t %>%
  spread(key, value)

type	time	height	width
A	1	10	50
A	2	20	60
B	1	30	70
B	2	40	80

Likewise, we can spread our data so that each category of time is now a separate variable, as in ??.

PlayData.t %>%
  spread(time, value)

type	key	1	2
A	height	10	20
A	width	50	60
B	height	30	40
B	width	70	80

Or so that type is defined in separate variables

PlayData.t %>%
  spread(type, value)

time	key	A	B
1	height	10	30
1	width	50	70
2	height	20	40
2	width	60	80

The three transformation function scenarios are straight-forward, if our starting point is tidy data! But the power of tidy data becomes apparent when grouping our data according to a factor variable. This allows us to apply not only transformation functions, but aggregation functions as well. For this we turn to the dplyr package.

trt2	trt1	ctrl
6.3	4.8	4.2
5.1	4.2	5.6
5.5	4.4	5.2
5.5	3.6	6.1
5.4	5.9	4.5
5.3	3.8	4.6
4.9	6.0	5.2
6.2	4.9	4.5
5.8	4.3	5.3
5.3	4.7	5.1

trt2	trt1	ctrl
6.3	4.8	4.2
5.1	4.2	5.6
5.5	4.4	5.2
5.5	3.6	6.1
5.4	5.9	4.5
5.3	3.8	4.6
4.9	6.0	5.2
6.2	4.9	4.5
5.8	4.3	5.3
5.3	4.7	5.1

Chapter 14 Element 7: Tidy Data with tidyr

14 Tidy Data with the tidyr Package

14 Tidy Data with the `tidyr` Package

trt2	trt1	ctrl
6.3	4.8	4.2
5.1	4.2	5.6
5.5	4.4	5.2
5.5	3.6	6.1
5.4	5.9	4.5
5.3	3.8	4.6
4.9	6.0	5.2
6.2	4.9	4.5
5.8	4.3	5.3
5.3	4.7	5.1