2 Tidy Data with tidyr

2.1 Learning Objectives

Recall that tidy data is defined as a data frame that has:

  • One observation per row,
  • One variable per column, and
  • One observational unit per data frame.

Consider the follow tables.

First, a long, tidy data frame:

Table 2.1: The long, tidy data frame.
weight group
4.17 ctrl
5.58 ctrl
5.18 ctrl
6.11 ctrl
4.50 ctrl
4.61 ctrl

Second, a wide, messy data frame:

Table 2.2: A wide, messy data frame.
ctrl trt1 trt2
4.17 4.81 6.31
5.58 4.17 5.12
5.18 4.41 5.54
6.11 3.59 5.50
4.50 5.87 5.37
4.61 3.83 5.29
5.17 6.03 4.92
4.53 4.89 6.15
5.33 4.32 5.80
5.14 4.69 5.26

Data should be arranged in a format that makes downstream analysis easier, instead of forcing functions to work on poorly formatted data. We can see this with the PlantGrowth data set (table 2.1). The long, tidy format allows us to carry out easy commands like making plots, defining linear models and even calculating group-wise descriptive statistics. A more typical way to store this data would be like table 2.2. But this would make it much more difficult to work with. Can you imagine how to draw the same box plot if your data was in this format? It’s possible, but not nice!

Some statisticians, bioinformaticians and data scientists estimate that about three-quarters of their time is spent on “data munging,” that is, getting data cleaned-up and tidy so that they can actually analyse it.

Exercise 2.1 Read in the wide format of the PlantGrowth data set using the command given below
library(tidyverse)
PG_wide <- read_tsv("data/PlantGrowth_Wide.txt")
glimpse(PG_wide)
#> Rows: 10
#> Columns: 3
#> $ ctrl <dbl> 4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.5…
#> $ trt1 <dbl> 4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.8…
#> $ trt2 <dbl> 6.31, 5.12, 5.54, 5.50, 5.37, 5.29, 4.92, 6.1…

Notice that we used tidyr::read_tsv() instead of utils::read.delim(). These functions from the tidyr package are a bit more convenient than the base package functions and all contain an _ instead of a ., which is a common feature in tidyverse syntax.

This has one key consequence here. The data frame is not just a data frame, but it’s also a tibble:

class(PG_wide)
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Thus, when we print it to the screen, we’ll only get the first 10 lines by default, which makes it convenient for prototyping functions and collections of functions.

PG_wide
#> # A tibble: 10 × 3
#>     ctrl  trt1  trt2
#>    <dbl> <dbl> <dbl>
#>  1  4.17  4.81  6.31
#>  2  5.58  4.17  5.12
#>  3  5.18  4.41  5.54
#>  4  6.11  3.59  5.5 
#>  5  4.5   5.87  5.37
#>  6  4.61  3.83  5.29
#>  7  5.17  6.03  4.92
#>  8  4.53  4.89  6.15
#>  9  5.33  4.32  5.8 
#> 10  5.14  4.69  5.26

Here, that isn’t so important since we only have 10 lines, but in general the output format will be nicer and fit at many columns as can fit onto your screen, but we’ll also see the type of each column, which is pretty handy.

Another consequence is that tibbles stay data frames no matter what. That means that:

PG_wide[,3]
#> # A tibble: 10 × 1
#>     trt2
#>    <dbl>
#>  1  6.31
#>  2  5.12
#>  3  5.54
#>  4  5.5 
#>  5  5.37
#>  6  5.29
#>  7  4.92
#>  8  6.15
#>  9  5.8 
#> 10  5.26

is the same as

PG_wide[3]
#> # A tibble: 10 × 1
#>     trt2
#>    <dbl>
#>  1  6.31
#>  2  5.12
#>  3  5.54
#>  4  5.5 
#>  5  5.37
#>  6  5.29
#>  7  4.92
#>  8  6.15
#>  9  5.8 
#> 10  5.26

if we just had a regular data frame in base R, this wouldn’t be the case, PG_wide[,3] would resuld in a vector, and PG_wide[3] would result in a data frame. Base R switches between data frames and vectors.

Now that we have some small play data, let’s try to complete the following exercises using PG_wide. Give each exercise an honest effort. You’ll find that they are possible, but quite a bit of work to accomplish.

Exercise 2.2 Draw a jittered dot plot of the weight described by each treatment type. The result should look like below.
Exercise 2.3 Calculate the group-wise mean and standard deviation for all 10 observations in each treatment group. Save the results as a data frame. The results should look like below.
group avg stdev
ctrl 5.032 0.5830914
trt1 4.661 0.7936757
trt2 5.526 0.4425733
Exercise 2.4 Perform a z-score transformation within each treatment type. Try using the built-in function scale(...), where ... is the vector to apply a z-transformation to. Your results should look like below.
ctrl trt1 trt2
-1.4783275 0.18773411 1.77145804
0.9398184 -0.61864059 -0.91736220
0.2538196 -0.31625008 0.03163318
1.8487668 -1.34941766 -0.05874733
-0.9123784 1.52329220 -0.35248400
-0.7237288 -1.04702715 -0.53324502
0.2366696 1.72488588 -1.36926476
-0.8609285 0.28853095 1.40993599
0.5110691 -0.42964652 0.61910651
0.1852197 0.03653885 -0.60103041
Exercise 2.5 Calculate a 1-way ANOVA of each plant’s weight described by it’s treatment. Try using anova(lm(y ~ x, data = ...)). The results should look like the table below:
Table 2.3: The results of a one-way ANOVA on the plant growth data set.
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.76634 1.8831700 4.846088 0.01591
Residuals 27 10.49209 0.3885959 NA NA

2.2 The pipe operator - %>%

Before we get into tidyverse functions, let’s cover a fundamental punctuation: %>%, aka the pipe operator. The pipe operator is a shorthand for calling functions, such that:

function(x) == x %>% function() (When speaking the commands say and then.)

This is not specific to dplyr functions. It can be used with any R functions:

aa <- 1:10

# The usual way
mean(aa)
#> [1] 5.5

# or with the pipe operator
aa %>% 
  mean()
#> [1] 5.5

The advantage here is that we can string together a long series of functions that would be very difficult to read in the regular R nomenclature, which we’ll see in a minute as our functions get more complex.

2.3 Tidy Data with the tidyr Package

To understand tidy data, let’s get PF_wide into a better format.

To work with our data in R, we want all variables in their own columns. To achieve this, we will gather our data into a long, tidy form. To understand what this means, we need to realize that there are essentially two different types of variables:

ID Variables are all the possible grouping variables which were measured. These include both independent and dependent categorical variables. ID variables are used to group, i.e. identify, our measurement variables. Our ID variables are type and time.

Measurement Variables are what was measured, here that’s the height and width.

2.4 Pivot to longer

To get tidy data we’ll use the flexible tidyr::pivot_longer() function. This function takes four arguments:

  1. The wide data frame to make long (i.e. tidy).
  2. The ID (specified with -) or MEASURE variables.
  3. The name of the output column for the KEY.
  4. The name of the output column for the VALUE.

Here, all columns are MEASURE columns, so use use everything() to specify that.

PG_wide %>% 
  pivot_longer(cols = everything(), names_to = "weight", values_to = "group") -> PG_long
weight group
ctrl 4.17
trt1 4.81
trt2 6.31
ctrl 5.58
trt1 4.17
trt2 5.12
ctrl 5.18
trt1 4.41
trt2 5.54
ctrl 6.11
trt1 3.59
trt2 5.50
ctrl 4.50
trt1 5.87
trt2 5.37
ctrl 4.61
trt1 3.83
trt2 5.29
ctrl 5.17
trt1 6.03
trt2 4.92
ctrl 4.53
trt1 4.89
trt2 6.15
ctrl 5.33
trt1 4.32
trt2 5.80
ctrl 5.14
trt1 4.69
trt2 5.26

Now that you have long tidy data. Revisit the exercises from above:

Exercise 2.6 Draw a jittered dot plot of the weight described by each treatment type. The result should look like below.
Exercise 2.7 Calculate the group-wise mean and standard deviation for all 10 observations in each treatment group. Save the results as a data frame. The results should look like below.
Exercise 2.8 Perform a z-score transformation within each treatment type. Your results should look like below.
Exercise 2.9 Calculate a 1-way ANOVA of each plant’s weight described by it’s treatment. The results should look like the table below: