3 The Grammar of Data Analysis with dplyr

3.1 Learning Objectives

The tidyverse ecosystem is very data frame-centric and often relies on tibbles. dplyr is a core component of the tidyverse and is comprised of three main components:

  1. The pipe operator %>%,
  2. The group_by() adverb, and
  3. The five verbs of dplyr:
    1. filter(), (Section 3.4.1),
    2. arrange(), (Section 3.4.2),
    3. select(), (Section 3.4.3) ,
    4. mutate(), (Section 3.4.4) ,
    5. summarise(), (Section 3.4.4)

3.2 The grammar of data analysis, dplyr

Moving forward we’ll use the built-in data frame called PlantGrowth instead of the PG_long data frame we created in the last chapter. You’ll recognize that it’s the same data frame.

Split-apply-combine refers to a series of actions that are often repeated in data analysis and statistics. A data-set is

  • Split into sub-groups defined by a categorical (aka factor) variable. A function is then
  • Applied to each individual sub-set, and the results are
  • Combined into a new data-set or added onto the original data-set.

There are many ways to perform split-apply-combine operations in R, so this is nothing new.

The apply family of functions are powerful are useful base package functions. Typical base package functions include apply, tapply(), sapply(), lapply(), mapply(), by(), and aggregrate().

However, they proved difficult to master for new-comers since the class for the input and output of each function was different, and not obvious. e.g. Can you identify what the type and structure of this output is:

tapply(PlantGrowth$weight, PlantGrowth$group, mean)
#>  ctrl  trt1  trt2 
#> 5.032 4.661 5.526

For a newbie, that is no easy task! Recall, one of the biggest problems you’ll encounter is having data in the wrong class!

There has been a major effort within the R community to develop easier tools for performing these tasks. The dplyr package has now come to dominate split-apply-combine tasks, largely because the commands are intuitive and syntactically uniform.

The many packages which have moved R into this direction are called the tidyverse.

The dplyr package takes the analogy of the program language closer to spoken language by referring to the data frames as nouns and the actions performed as verbs. You’ll also encounter the punctuation %>%, an adverb group_by() and a pronoun .data. The above command is written as below. It’s even more flexible since we can easily ass more functions to summarise.

PlantGrowth %>%
  group_by(group) %>%
  summarise(avg = mean(weight),
            stdev = sd(weight))
#> # A tibble: 3 × 3
#>   group   avg stdev
#>   <fct> <dbl> <dbl>
#> 1 ctrl   5.03 0.583
#> 2 trt1   4.66 0.794
#> 3 trt2   5.53 0.443

3.3 Tibbles

In this section we’ll use a larger file. It’s a collection of proteins and some attributes from an experiment to quantify their abundance in various cell lines. Let’s get started by loading the package and reading in a text file called protein.txt. You can find it using the command below.

library(tidyverse) # load all core tidyverse packages
protein <- read_tsv("/data/Protein.txt")

The first thing we’d want to do with our data frame is convert it to a tibble. Actually we don’t need to since we used tidyr functions.

class(protein)
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Recall, when we call:

protein
#> # A tibble: 1,223 × 15
#>    Description     Uniprot   Peptides MW.kDa Sequence.Length
#>    <chr>           <chr>        <dbl>  <dbl>           <dbl>
#>  1 >ENSEMBL:ENSBT… <NA>            14  193.             1742
#>  2 Tubulin alpha-… TBA1B_MO…       10   50.2             451
#>  3 >ENSEMBL:ENSBT… <NA>            13  164.             1477
#>  4 >ENSEMBL:ENSBT… <NA>             3   65.2             578
#>  5 >P01030 SWISS-… <NA>            14  193.             1741
#>  6 >P01966 SWISS-… <NA>             3   15.2             142
#>  7 >P02769 SWISS-… <NA>             7   69.3             607
#>  8 >P12763 SWISS-… <NA>             3   38.4             359
#>  9 >P15636 SWISS-… <NA>             1   68.1             653
#> 10 >P34955 SWISS-… <NA>             9   46.1             416
#> # … with 1,213 more rows, and 10 more variables:
#> #   Ratio.M.L <dbl>, Ratio.M.L.Sig <dbl>, Ratio.H.L <dbl>,
#> #   Ratio.H.L.Sig <dbl>, Ratio.H.M <dbl>,
#> #   Ratio.H.M.Sig <dbl>, Intensity.L <dbl>,
#> #   Intensity.M <dbl>, Intensity.H <dbl>, Contaminant <chr>

it will print nicer. Another really useful dplyr function is glimpse():

glimpse(protein)
#> Rows: 1,223
#> Columns: 15
#> $ Description     <chr> ">ENSEMBL:ENSBTAP00000007350 (Bos …
#> $ Uniprot         <chr> NA, "TBA1B_MOUSE;O89052_MOUSE;Q99K…
#> $ Peptides        <dbl> 14, 10, 13, 3, 14, 3, 7, 3, 1, 9, …
#> $ MW.kDa          <dbl> 192.990, 50.151, 164.340, 65.190, …
#> $ Sequence.Length <dbl> 1742, 451, 1477, 578, 1741, 142, 6…
#> $ Ratio.M.L       <dbl> 0.065122, 0.519120, 0.103550, 0.07…
#> $ Ratio.M.L.Sig   <dbl> 6.4800e-08, 3.4958e-01, 2.7800e-05…
#> $ Ratio.H.L       <dbl> 0.023451, 0.285370, 0.049841, 0.04…
#> $ Ratio.H.L.Sig   <dbl> 2.0200e-08, 2.8918e-01, 7.3900e-05…
#> $ Ratio.H.M       <dbl> 0.35098, 0.49988, 0.44963, 0.47136…
#> $ Ratio.H.M.Sig   <dbl> 0.01815200, 0.25055000, 0.19706000…
#> $ Intensity.L     <dbl> 91981000, 30801000, 35614000, 9823…
#> $ Intensity.M     <dbl> 9215400, 18667000, 8236200, 182610…
#> $ Intensity.H     <dbl> 5263600, 10136000, 4904200, 120500…
#> $ Contaminant     <chr> "+", "+", "+", "+", "+", "+", "+",…

This is like str(), but it provides a more human-friendly output. Those are nice new functions, but let’s move onto the real purpose of dplyr.

There are three main components to understanding dplyr1.

  • The pipe operator
  • The five verbs plus the helper functions, and
  • An adverb

3.4 The Five Verbs

There are five verbs which we can use to act upon our noun:

Table 3.1: The five dplyr verbs.
Operates on Verb Description Section
Observations filter() Filter observations given specific criteria (see subset()). Section 3.4.1
Observations arrange() Rearrange observations given specific criteria (see order()). Section 3.4.2
Variables select() Select variables meeting specific criteria. Section 3.4.3
Variables mutate() Apply transformation functions on selected variables. Section 3.4.4
Variables summarise() Apply aggregation functions on selected variables. Section 3.4.4

Let’s take a look at these functions in the context of operations we already understand

3.4.1 filter()

Recall that we can use [] to filter a data frame.

protein[protein$Contaminant != "+", ]

The above [] call is equivalent to filter(), except it the handling of NA values. Here, filter() is just the dplyr version:


protein %>%
    filter(Contaminant != "+")

The advantage is that it’s easy to combine many logical expressions with an & using a comma:


protein %>%
    filter(Contaminant != "+", Ratio.M.L.Sig < 0.05)

3.4.2 arrange()

Previously, we extracted the Uniprot IDs having the top 20 M/L ratios. First we had to get the index of the order and then apply that to the data frame. Here, we can use the arrange() function:2

protein %>%
    arrange(desc(Ratio.M.L)) %>%
    select(Uniprot) %>%
    .[1:20,] -> topML

Some alternatives for the last line in the above commands:

slice(1:20) -> topML

or

head(20) -> topML

The result is a one-column data frame. Remember, dplyr is very data frame centric.

glimpse(topML)
#> Rows: 20
#> Columns: 1
#> $ Uniprot <chr> "ANXA2_MOUSE;Q3UCD3_MOUSE;Q542G9_MOUSE;Q99…

3.4.3 select()

select() is used for choosing specific columns.


protein %>%
    select(Ratio.M.L, Ratio.H.L, Ratio.H.M)
#> # A tibble: 1,223 × 3
#>    Ratio.M.L Ratio.H.L Ratio.H.M
#>        <dbl>     <dbl>     <dbl>
#>  1    0.0651    0.0235     0.351
#>  2    0.519     0.285      0.500
#>  3    0.104     0.0498     0.450
#>  4    0.0728    0.0405     0.471
#>  5   NA        NA         NA    
#>  6    0.174     0.166      0.997
#>  7    0.0976    0.0793     0.661
#>  8    0.187     0.0374     0.249
#>  9   NA        NA         NA    
#> 10    0.0656    0.0305     0.421
#> # … with 1,213 more rows

# recall indexing:
# foo.df[1:2] # first two columns
# foo.df[-3] # all except the 3rd

This is all fine and good, but we also have a number of helper functions that we can use to access columns.

Table 3.2: The dplyr helper functions.
Helper function Description
starts_with(x) Names starts with x.
ends_with(x) Names ends in x.
contains(x) Selects variable names containing x.
matches(x) Selects all variable names matching the regex x.
num_range("x", 1:5) Selects x1 to x5.
one_of("x", "y", "z") Selects variables in a character vector.
everything() Selects all variables.

protein %>%
    select(starts_with("Rat"))
#> # A tibble: 1,223 × 6
#>    Ratio.M.L Ratio.M.L.Sig Ratio.H.L Ratio.H.L.Sig Ratio.H.M
#>        <dbl>         <dbl>     <dbl>         <dbl>     <dbl>
#>  1    0.0651  0.0000000648    0.0235  0.0000000202     0.351
#>  2    0.519   0.350           0.285   0.289            0.500
#>  3    0.104   0.0000278       0.0498  0.0000739        0.450
#>  4    0.0728  0.0000516       0.0405  0.0000222        0.471
#>  5   NA      NA              NA      NA               NA    
#>  6    0.174   0.0128          0.166   0.110            0.997
#>  7    0.0976  0.000014        0.0793  0.000705         0.661
#>  8    0.187   0.00418         0.0374  0.00000167       0.249
#>  9   NA      NA              NA      NA               NA    
#> 10    0.0656  0.00000016      0.0305  0.000000179      0.421
#> # … with 1,213 more rows, and 1 more variable:
#> #   Ratio.H.M.Sig <dbl>

Remember, we can also use the - notation (like we did in []) to not select specific columns.3


protein %>%
    select(starts_with("Rat"), -ends_with("Sig"))
#> # A tibble: 1,223 × 3
#>    Ratio.M.L Ratio.H.L Ratio.H.M
#>        <dbl>     <dbl>     <dbl>
#>  1    0.0651    0.0235     0.351
#>  2    0.519     0.285      0.500
#>  3    0.104     0.0498     0.450
#>  4    0.0728    0.0405     0.471
#>  5   NA        NA         NA    
#>  6    0.174     0.166      0.997
#>  7    0.0976    0.0793     0.661
#>  8    0.187     0.0374     0.249
#>  9   NA        NA         NA    
#> 10    0.0656    0.0305     0.421
#> # … with 1,213 more rows

OK, so now that we have our columns, we can pass them along to some functions:


protein %>%
    select(starts_with("Rat"), -ends_with("Sig")) %>%
    log2()
#> # A tibble: 1,223 × 3
#>    Ratio.M.L Ratio.H.L Ratio.H.M
#>        <dbl>     <dbl>     <dbl>
#>  1    -3.94      -5.41  -1.51   
#>  2    -0.946     -1.81  -1.00   
#>  3    -3.27      -4.33  -1.15   
#>  4    -3.78      -4.63  -1.09   
#>  5    NA         NA     NA      
#>  6    -2.52      -2.59  -0.00464
#>  7    -3.36      -3.66  -0.598  
#>  8    -2.42      -4.74  -2.00   
#>  9    NA         NA     NA      
#> 10    -3.93      -5.04  -1.25   
#> # … with 1,213 more rows

In this very simplistic view, we need to attach the original data frame back to the \(log_2\) transformed columns. This is doable, but there are better ways. For example, we can use the dplyr function bind_cols(). This is a dplyr version of cbind().

protein %>%
    select(starts_with("Rat"), -ends_with("Sig")) %>%
    log2() %>%
    bind_cols(protein)
#> New names:
#> * Ratio.M.L -> Ratio.M.L...1
#> * Ratio.H.L -> Ratio.H.L...2
#> * Ratio.H.M -> Ratio.H.M...3
#> * Ratio.M.L -> Ratio.M.L...9
#> * Ratio.H.L -> Ratio.H.L...11
#> * ...
#> # A tibble: 1,223 × 18
#>    Ratio.M.L...1 Ratio.H.L...2 Ratio.H.M...3 Description    
#>            <dbl>         <dbl>         <dbl> <chr>          
#>  1        -3.94          -5.41      -1.51    >ENSEMBL:ENSBT…
#>  2        -0.946         -1.81      -1.00    Tubulin alpha-…
#>  3        -3.27          -4.33      -1.15    >ENSEMBL:ENSBT…
#>  4        -3.78          -4.63      -1.09    >ENSEMBL:ENSBT…
#>  5        NA             NA         NA       >P01030 SWISS-…
#>  6        -2.52          -2.59      -0.00464 >P01966 SWISS-…
#>  7        -3.36          -3.66      -0.598   >P02769 SWISS-…
#>  8        -2.42          -4.74      -2.00    >P12763 SWISS-…
#>  9        NA             NA         NA       >P15636 SWISS-…
#> 10        -3.93          -5.04      -1.25    >P34955 SWISS-…
#> # … with 1,213 more rows, and 14 more variables:
#> #   Uniprot <chr>, Peptides <dbl>, MW.kDa <dbl>,
#> #   Sequence.Length <dbl>, Ratio.M.L...9 <dbl>,
#> #   Ratio.M.L.Sig <dbl>, Ratio.H.L...11 <dbl>,
#> #   Ratio.H.L.Sig <dbl>, Ratio.H.M...13 <dbl>,
#> #   Ratio.H.M.Sig <dbl>, Intensity.L <dbl>,
#> #   Intensity.M <dbl>, Intensity.H <dbl>, …

There are even better ways, which we’ll see in the next section.

3.4.4 mutate() & summarise()

Remember, we have two basic types of functions we want to apply to our data:

Some common normalizations (transformations) include z-scores and scaling on a min-max range [0,1]:


PlantGrowth %>%
  mutate(Z_score = (weight - mean(weight))/sd(weight),
         min.max = (weight - min(weight)) / (max(weight) - min(weight)))
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> # A tibble: 30 × 4
#>    weight group Z_score min.max
#>     <dbl> <fct>   <dbl>   <dbl>
#>  1   4.17 ctrl  -1.29     0.213
#>  2   5.58 ctrl   0.723    0.732
#>  3   5.18 ctrl   0.153    0.585
#>  4   6.11 ctrl   1.48     0.926
#>  5   4.5  ctrl  -0.817    0.335
#>  6   4.61 ctrl  -0.660    0.375
#>  7   5.17 ctrl   0.138    0.581
#>  8   4.53 ctrl  -0.774    0.346
#>  9   5.33 ctrl   0.367    0.640
#> 10   5.14 ctrl   0.0956   0.570
#> # … with 20 more rows

Descriptive statistics are aggregation functions:

PlantGrowth %>%
  summarise(averge = mean(weight),
            st.dev = sd(weight))
#>   averge    st.dev
#> 1  5.073 0.7011918

If we see what we did previously with the \(log_{2}\) ratios, you may decide we need something more elegant. For example in protein, we want to transform (i.e. mutate()) 3 columns using \(log_{2}\) and another 3 columns using \(log_{10}\). For starters, we can do this:


protein %>%
    select(starts_with("Rat"), -ends_with("Sig")) %>%
    mutate(Ratio.M.L_Log2 = log2(Ratio.M.L),
           Ratio.H.L_Log2 = log2(Ratio.H.L),
           Ratio.H.M_Log2 = log2(Ratio.H.M))
#> # A tibble: 1,223 × 6
#>    Ratio.M.L Ratio.H.L Ratio.H.M Ratio.M.L_Log2
#>        <dbl>     <dbl>     <dbl>          <dbl>
#>  1    0.0651    0.0235     0.351         -3.94 
#>  2    0.519     0.285      0.500         -0.946
#>  3    0.104     0.0498     0.450         -3.27 
#>  4    0.0728    0.0405     0.471         -3.78 
#>  5   NA        NA         NA             NA    
#>  6    0.174     0.166      0.997         -2.52 
#>  7    0.0976    0.0793     0.661         -3.36 
#>  8    0.187     0.0374     0.249         -2.42 
#>  9   NA        NA         NA             NA    
#> 10    0.0656    0.0305     0.421         -3.93 
#> # … with 1,213 more rows, and 2 more variables:
#> #   Ratio.H.L_Log2 <dbl>, Ratio.H.M_Log2 <dbl>

But you can imagine that that’s pretty tedious. So to help automate things, we have a series of helper functions.

Table 3.3: Variations of summarise() and mutate(). For each variant, replace ... with either mutate, summarise, or summarize.
Function variant Description Section
*_if() Conditional mapping Section 3.4.5
*_at() Specify columns Section 3.4.6
*_all() All variables Not discussed
*_each() Each, renaming Not discussed

3.4.5 summarise_if(), mutate_if()

_if variants mutate or summarise given a conditional equation.


PlantGrowth %>%
  summarise_if(is.numeric, mean)
#>   weight
#> 1  5.073

3.4.6 summarise_at(), mutate_at()

summarise_at() allows us to bypass select() by using the helper functions with the vars() function.4 For example, we can get the mean of specific columns we are interested in:


protein %>%
  summarise_at(vars(Peptides, MW.kDa, Sequence.Length), mean)

Or, we can perform transformations on specific columns:


protein %>%
  mutate_at(vars(starts_with("Rat"), -ends_with("Sig")), log2)

This is what we’ll do here, because we can mutate in-situ:


protein %>%
    filter(Contaminant != "+") %>%
    mutate_at(vars(starts_with("Rat"), -ends_with("Sig")), log2) %>%
    mutate_at(vars(starts_with("Int")), funs(log10)) %>%
    mutate(Intensity.H.L = Intensity.H + Intensity.L,
           Intensity.H.M = Intensity.H + Intensity.M,
           Intensity.M.L = Intensity.M + Intensity.L)

3.5 group_by()

So far we’ve seen transformations and aggregation functions on the entire data set, but what of the split component in split-apply-combine? The final piece of the puzzle is the group_by() function, which tells R how to group data.

PlantGrowth %>% 
  group_by(group) %>% 
  summarise(avg = mean(weight))
#> # A tibble: 3 × 2
#>   group   avg
#>   <fct> <dbl>
#> 1 ctrl   5.03
#> 2 trt1   4.66
#> 3 trt2   5.53

3.6 The glue package

The glue package is useful for creating strings as combinations of characters and computed values and is a bit easier that using paste().


library(tidyverse)
iris %>% 
  gather(key, value, -Species) %>% 
  group_by(Species, key) %>% 
  summarise(avg = mean(value)) -> iris_summary
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.

plant <- "versicolor"
metric <- "Petal.Length"

iris_summary$avg[iris_summary$Species == plant & iris_summary$key == metric]
#> [1] 4.26

library(glue)
#> 
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
glue('My average {metric} of {plant} is {iris_summary$avg[iris_summary$Species == plant & iris_summary$key == metric]}cm.')
#> My average Petal.Length of versicolor is 4.26cm.
library(dplyr)
head(iris) %>%
  mutate(description = glue("This {Species} has a petal length of {Petal.Length}"))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
#>                             description
#> 1 This setosa has a petal length of 1.4
#> 2 This setosa has a petal length of 1.4
#> 3 This setosa has a petal length of 1.3
#> 4 This setosa has a petal length of 1.5
#> 5 This setosa has a petal length of 1.4
#> 6 This setosa has a petal length of 1.7