Chapter 14 Element 7: Tidyverse – dplyr

14.1 Learning Objectives

The dplyr ecosystem is composed of three main components:

The pipe operator %>%,
The five verbs of dplyr:
- filter(),
- arrange(),
- select(),
- mutate(),
- summarise(), and
group_by()

14.2 Split-Apply-Combine with the dplyr Package

Split-apply-combine refers to a series of actions that are often repeated in data analysis and statistics. A data-set is

Split into sub-groups defined by a categorical (aka factor) variable. A function is then
Applied to each individual sub-set, and the results are
Combined into a new data-set or added onto the original data-set.

There are many ways to perform split-apply-combine operations in R. Actually, we already saw this in action on the first day of the workshop (4).

PlantGrowth %>%
  group_by(group) %>%
  summarise(avg = mean(weight),
            stdev = sd(weight))

# # A tibble: 3 x 3
#   group   avg stdev
#   <fct> <dbl> <dbl>
# 1 trt2   5.53 0.443
# 2 trt1   4.66 0.794
# 3 ctrl   5.03 0.583

The apply family of functions are powerful are useful base package functions. However, they proved difficult to master for new-comers since the class for the input and output of each function was different, and not obvious. There has been a major effort within the R community to develop easier tools for performing these tasks. The dplyr package has now come to dominate split-apply-combine tasks, largely because the commands are intuitive and syntactically uniform.

The dplyr package takes the analogy of the program language closer to spoken language by referring to the data frames as nouns and the actions performed as verbs.

14.3 Tibbles

Let’s get started by loading the package. We’ll see some familiar operations done in a new way, so let’s re-import the protein.txt file, just to be sure we’re all on the same page.

library(tidyverse) # or only dply with library(dplyr)
protein_df <- read.delim("protein.txt", stringsAsFactors = FALSE)

The first thing we want to do with our data frame is convert it to a tibble. Actually we’re not changing anything about the data frame, it’s still a data frame, but we’re adding a new attribute, so that

class(protein_df)

# [1] "data.frame"

attributes(protein_df)$class

# [1] "data.frame"

# Convert data frame to a tibble (tbl):
protein_df <- tbl_df(protein_df)

just adds some attributes to the object:⁴⁶

class(protein_df)

# [1] "tbl_df"     "tbl"        "data.frame"

The only reason we do this is so that when we call

protein_df

# # A tibble: 1,223 x 15
#    Description Uniprot Peptides MW.kDa Sequence.Length Ratio.M.L
#    <chr>       <chr>      <int>  <dbl>           <int>     <dbl>
#  1 >ENSEMBL:E… ""            14  193.             1742    0.0651
#  2 Tubulin al… "TBA1B…       10   50.2             451    0.519 
#  3 >ENSEMBL:E… ""            13  164.             1477    0.104 
#  4 >ENSEMBL:E… ""             3   65.2             578    0.0728
#  5 >P01030 SW… ""            14  193.             1741   NA     
#  6 >P01966 SW… ""             3   15.2             142    0.174 
#  7 >P02769 SW… ""             7   69.3             607    0.0976
#  8 >P12763 SW… ""             3   38.4             359    0.187 
#  9 >P15636 SW… ""             1   68.1             653   NA     
# 10 >P34955 SW… ""             9   46.1             416    0.0656
# # … with 1,213 more rows, and 9 more variables:
# #   Ratio.M.L.Sig <dbl>, Ratio.H.L <dbl>, Ratio.H.L.Sig <dbl>,
# #   Ratio.H.M <dbl>, Ratio.H.M.Sig <dbl>, Intensity.L <int>,
# #   Intensity.M <int>, Intensity.H <int>, Contaminant <chr>

it will print nicer. Another really useful dplyr function is glimpse():

glimpse(protein_df)

# Rows: 1,223
# Columns: 15
# $ Description     <chr> ">ENSEMBL:ENSBTAP00000007350 (Bos taurus)…
# $ Uniprot         <chr> "", "TBA1B_MOUSE;O89052_MOUSE;Q99KA2_MOUS…
# $ Peptides        <int> 14, 10, 13, 3, 14, 3, 7, 3, 1, 9, 9, 1, 7…
# $ MW.kDa          <dbl> 193, 50, 164, 65, 193, 15, 69, 38, 68, 46…
# $ Sequence.Length <int> 1742, 451, 1477, 578, 1741, 142, 607, 359…
# $ Ratio.M.L       <dbl> 0.065, 0.519, 0.104, 0.073, NA, 0.174, 0.…
# $ Ratio.M.L.Sig   <dbl> 6.5e-08, 3.5e-01, 2.8e-05, 5.2e-05, NA, 1…
# $ Ratio.H.L       <dbl> 0.023, 0.285, 0.050, 0.040, NA, 0.166, 0.…
# $ Ratio.H.L.Sig   <dbl> 2.0e-08, 2.9e-01, 7.4e-05, 2.2e-05, NA, 1…
# $ Ratio.H.M       <dbl> 0.35, 0.50, 0.45, 0.47, NA, 1.00, 0.66, 0…
# $ Ratio.H.M.Sig   <dbl> 0.01815, 0.25055, 0.19706, 0.26863, NA, 0…
# $ Intensity.L     <int> 91981000, 30801000, 35614000, 9823100, 19…
# $ Intensity.M     <int> 9215400, 18667000, 8236200, 1826100, 9003…
# $ Intensity.H     <int> 5263600, 10136000, 4904200, 1205000, 1443…
# $ Contaminant     <chr> "+", "+", "+", "+", "+", "+", "+", "+", "…

This is like str(), but it provides a more human-friendly output. Those are nice new functions, but let’s move onto the real purpose of dplyr.

There are three main components to understanding dplyr.⁴⁷

The pipe operator
The five verbs plus the helper functions
An adverb

14.4 The pipe operator - %>%

The pipe operator is a shorthand for calling functions, such that:

function(x) == x \%>\% function()⁴⁸

This is not specific to dplyr functions. It can be used with any R functions:

aa <- 1:10

# The usual way
mean(aa)

# [1] 5.5

# or with the pipe operator
aa %>% 
  mean()

# [1] 5.5

The advantage here is that we can string together a long series of functions that would be very difficult to read in the regular R nomenclature, which we’ll see in a minute as our functions get more complex.

14.5 The Five Verbs

There are five verbs which we can use to act upon our noun:

Table 14.1: The five `dplyr` verbs.
Verb	Description	Section
Operating on observations
`filter()`	Filter observations given specific criteria (see `subset()`).	Section 14.5.1
`arrange()`	Rearrange observations given specific criteria (see `order()`).	Section 14.5.2
Operating on variables
`select()`	Select variables meeting specific criteria.	Section 14.5.3
`mutate()`	Apply transformation functions on selected variables.	Section 14.5.4
`summarise()`	Apply aggregation functions on selected variables.	Section 14.5.4

Let’s take a look at these functions in the context of operations we already understand

14.5.1 filter()

Recall that we can use subset() or [] to filter a data frame.

subset(protein_df, Contaminant != "+")
protein_df[protein_df$Contaminant != "+", ]

The above functions are equivalent. Here, filter() is just the dplyr version:

protein_df %>%
    filter(Contaminant != "+")

The advantage is that it’s easy to combine many logical expressions with an & using a comma:

protein_df %>%
    filter(Contaminant != "+", Ratio.M.L.Sig < 0.05)

14.5.2 arrange()

Previously, we extracted the Uniprot IDs having the top 20 M/L ratios. First we had to get the index of the order and then apply that to the data frame. Here, we can use the arrange() function:⁴⁹

protein_df %>%
    arrange(desc(Ratio.M.L)) %>%
    select(Uniprot) %>%
    .[1:20,] -> topML

Some alternatives for the last line in the above commands:

slice(1:20) -> topML

head(20) -> topML

The result is a one-column data frame. Remember, dplyr is very data frame centric.

glimpse(topML)

# Rows: 20
# Columns: 1
# $ Uniprot <chr> "ANXA2_MOUSE;Q3UCD3_MOUSE;Q542G9_MOUSE;Q99KH3_MOU…

14.5.3 select()

select() is used for choosing specific columns.

protein_df %>%
    select(Ratio.M.L, Ratio.H.L, Ratio.H.M)

# # A tibble: 1,223 x 3
#    Ratio.M.L Ratio.H.L Ratio.H.M
#        <dbl>     <dbl>     <dbl>
#  1    0.0651    0.0235     0.351
#  2    0.519     0.285      0.500
#  3    0.104     0.0498     0.450
#  4    0.0728    0.0405     0.471
#  5   NA        NA         NA    
#  6    0.174     0.166      0.997
#  7    0.0976    0.0793     0.661
#  8    0.187     0.0374     0.249
#  9   NA        NA         NA    
# 10    0.0656    0.0305     0.421
# # … with 1,213 more rows

# recall indexing:
foo_df[1:2] # first two columns

#   healthy    tissue
# 1    TRUE     Liver
# 2   FALSE     Brain
# 3   FALSE    Testes
# 4    TRUE    Muscle
# 5    TRUE Intestine
# 6   FALSE     Heart

foo_df[-3] # all except the 3rd

#   healthy    tissue
# 1    TRUE     Liver
# 2   FALSE     Brain
# 3   FALSE    Testes
# 4    TRUE    Muscle
# 5    TRUE Intestine
# 6   FALSE     Heart

This is all fine and good, but we also have a number of helper functions that we can use to access columns.

Table 14.2: The `dplyr` helper functions.
Helper function	Description
`starts_with(x)`	Names starts with x.
`ends_with(x)`	Names ends in x.
`contains(x)`	Selects variable names containing x.
`matches(x)`	Selects all variable names matching the regex x.
`num_range("x", 1:5)`	Selects x1 to x5.
`one_of("x", "y", "z")`	Selects variables in a character vector.
`everything()`	Selects all variables.

protein_df %>%
    select(starts_with("Rat"))

# # A tibble: 1,223 x 6
#    Ratio.M.L Ratio.M.L.Sig Ratio.H.L Ratio.H.L.Sig Ratio.H.M
#        <dbl>         <dbl>     <dbl>         <dbl>     <dbl>
#  1    0.0651  0.0000000648    0.0235  0.0000000202     0.351
#  2    0.519   0.350           0.285   0.289            0.500
#  3    0.104   0.0000278       0.0498  0.0000739        0.450
#  4    0.0728  0.0000516       0.0405  0.0000222        0.471
#  5   NA      NA              NA      NA               NA    
#  6    0.174   0.0128          0.166   0.110            0.997
#  7    0.0976  0.000014        0.0793  0.000705         0.661
#  8    0.187   0.00418         0.0374  0.00000167       0.249
#  9   NA      NA              NA      NA               NA    
# 10    0.0656  0.00000016      0.0305  0.000000179      0.421
# # … with 1,213 more rows, and 1 more variable: Ratio.H.M.Sig <dbl>

Remember, we can also use the - notation (like we did in []) to not select specific columns.⁵⁰

protein_df %>%
    select(starts_with("Rat"), -ends_with("Sig"))

# # A tibble: 1,223 x 3
#    Ratio.M.L Ratio.H.L Ratio.H.M
#        <dbl>     <dbl>     <dbl>
#  1    0.0651    0.0235     0.351
#  2    0.519     0.285      0.500
#  3    0.104     0.0498     0.450
#  4    0.0728    0.0405     0.471
#  5   NA        NA         NA    
#  6    0.174     0.166      0.997
#  7    0.0976    0.0793     0.661
#  8    0.187     0.0374     0.249
#  9   NA        NA         NA    
# 10    0.0656    0.0305     0.421
# # … with 1,213 more rows

OK, so now that we have our columns, we can pass them along to some functions:

protein_df %>%
    select(starts_with("Rat"), -ends_with("Sig")) %>%
    log2()

# # A tibble: 1,223 x 3
#    Ratio.M.L Ratio.H.L Ratio.H.M
#        <dbl>     <dbl>     <dbl>
#  1    -3.94      -5.41  -1.51   
#  2    -0.946     -1.81  -1.00   
#  3    -3.27      -4.33  -1.15   
#  4    -3.78      -4.63  -1.09   
#  5    NA         NA     NA      
#  6    -2.52      -2.59  -0.00464
#  7    -3.36      -3.66  -0.598  
#  8    -2.42      -4.74  -2.00   
#  9    NA         NA     NA      
# 10    -3.93      -5.04  -1.25   
# # … with 1,213 more rows

In this very simplistic view, we need to attach the original data frame back to the \(log_2\) transformed columns. This is doable, but there are better ways. For example, we can use the dplyr function bind\_cols(). This is a dplyr version of cbind().

protein_df %>%
    select(starts_with("Rat"), -ends_with("Sig")) %>%
    log2() %>%
    bind_cols(protein_df)

# # A tibble: 1,223 x 18
#    Ratio.M.L...1 Ratio.H.L...2 Ratio.H.M...3 Description Uniprot
#            <dbl>         <dbl>         <dbl> <chr>       <chr>  
#  1        -3.94          -5.41      -1.51    >ENSEMBL:E… ""     
#  2        -0.946         -1.81      -1.00    Tubulin al… "TBA1B…
#  3        -3.27          -4.33      -1.15    >ENSEMBL:E… ""     
#  4        -3.78          -4.63      -1.09    >ENSEMBL:E… ""     
#  5        NA             NA         NA       >P01030 SW… ""     
#  6        -2.52          -2.59      -0.00464 >P01966 SW… ""     
#  7        -3.36          -3.66      -0.598   >P02769 SW… ""     
#  8        -2.42          -4.74      -2.00    >P12763 SW… ""     
#  9        NA             NA         NA       >P15636 SW… ""     
# 10        -3.93          -5.04      -1.25    >P34955 SW… ""     
# # … with 1,213 more rows, and 13 more variables: Peptides <int>,
# #   MW.kDa <dbl>, Sequence.Length <int>, Ratio.M.L...9 <dbl>,
# #   Ratio.M.L.Sig <dbl>, Ratio.H.L...11 <dbl>, Ratio.H.L.Sig <dbl>,
# #   Ratio.H.M...13 <dbl>, Ratio.H.M.Sig <dbl>, Intensity.L <int>,
# #   Intensity.M <int>, Intensity.H <int>, Contaminant <chr>

There are even better ways, which we’ll see in the next section.

14.5.4 mutate() & summarise()

Remember, we have two basic types of functions we want to apply to our data:

Transformation using mutate(), and Aggregations using summarise()

Some common normalisations (transformations) include z-scores and scaling on a min-max range [0,1]:

PlantGrowth %>%
  mutate(Z_score = (weight - mean(weight))/sd(weight),
         min.max = (weight - min(weight)) / (max(weight) - min(weight)))

# # A tibble: 30 x 4
#    weight group Z_score min.max
#     <dbl> <fct>   <dbl>   <dbl>
#  1   4.17 ctrl  -1.29     0.213
#  2   5.58 ctrl   0.723    0.732
#  3   5.18 ctrl   0.153    0.585
#  4   6.11 ctrl   1.48     0.926
#  5   4.5  ctrl  -0.817    0.335
#  6   4.61 ctrl  -0.660    0.375
#  7   5.17 ctrl   0.138    0.581
#  8   4.53 ctrl  -0.774    0.346
#  9   5.33 ctrl   0.367    0.640
# 10   5.14 ctrl   0.0956   0.570
# # … with 20 more rows

Descriptive statistics are aggregation functions:

PlantGrowth %>%
  summarise(averge = mean(weight),
            st.dev = sd(weight))

# # A tibble: 1 x 2
#   averge st.dev
#    <dbl>  <dbl>
# 1   5.07  0.701

If we see what we did previously with the \(log_{2}\) ratios, you may decide we need something more elegant. For example in protein_df, we want to transform (i.e. mutate()) 3 columns using \(log_{2}\) and another 3 columns using \(log_{10}\). For starters, we can do this:

protein_df %>%
    select(starts_with("Rat"), -ends_with("Sig")) %>%
    mutate(Ratio.M.L_Log2 = log2(Ratio.M.L),
           Ratio.H.L_Log2 = log2(Ratio.H.L),
           Ratio.H.M_Log2 = log2(Ratio.H.M))

# # A tibble: 1,223 x 6
#    Ratio.M.L Ratio.H.L Ratio.H.M Ratio.M.L_Log2 Ratio.H.L_Log2
#        <dbl>     <dbl>     <dbl>          <dbl>          <dbl>
#  1    0.0651    0.0235     0.351         -3.94           -5.41
#  2    0.519     0.285      0.500         -0.946          -1.81
#  3    0.104     0.0498     0.450         -3.27           -4.33
#  4    0.0728    0.0405     0.471         -3.78           -4.63
#  5   NA        NA         NA             NA              NA   
#  6    0.174     0.166      0.997         -2.52           -2.59
#  7    0.0976    0.0793     0.661         -3.36           -3.66
#  8    0.187     0.0374     0.249         -2.42           -4.74
#  9   NA        NA         NA             NA              NA   
# 10    0.0656    0.0305     0.421         -3.93           -5.04
# # … with 1,213 more rows, and 1 more variable: Ratio.H.M_Log2 <dbl>

But you can imagine that that’s pretty tedious. So to help automate things, we have a series of helper functions.

Table 14.3: Variations of `summarise()` and `mutate()`. For each variant, replace `...` with either `mutate`, `summarise`, or `summarize`.
Function variant	Description	Section
`*_if()`	Conditional mapping	Section 14.5.5
`*_at()`	Specify columns	Section 14.5.6
`*_all()`	All variables	Not discussed
`*_each()`	Each, renaming	Not discussed

14.5.5 summarise_if(), mutate_if()

_if variants mutate or summarise given a conditional equation.

PlantGrowth %>%
  summarise_if(is.numeric, mean)

# # A tibble: 1 x 1
#   weight
#    <dbl>
# 1   5.07

Z.score <- function(x) {(x - mean(x))/sd(x)}
PlantGrowth %>%
  mutate_if(is.numeric, Z.score)

# # A tibble: 30 x 2
#     weight group
#      <dbl> <fct>
#  1 -1.29   ctrl 
#  2  0.723  ctrl 
#  3  0.153  ctrl 
#  4  1.48   ctrl 
#  5 -0.817  ctrl 
#  6 -0.660  ctrl 
#  7  0.138  ctrl 
#  8 -0.774  ctrl 
#  9  0.367  ctrl 
# 10  0.0956 ctrl 
# # … with 20 more rows

14.5.6 summarise_at(), mutate_at()

summarise\_at() allows us to bypass select() by using the helper functions with the vars() function.⁵¹ For example, we can get the mean of specific columns we are interested in:

protein_df %>%
  summarise_at(vars(Peptides, MW.kDa, Sequence.Length), mean)

Or, we can perform transformations on specific columns:

protein_df %>%
  mutate_at(vars(starts_with("Rat"), -ends_with("Sig")), log2)

This is what we’ll do here, because we can mutate in-situ:

protein_df %>%
    filter(Contaminant != "+") %>%
    mutate_at(vars(starts_with("Rat"), -ends_with("Sig")), log2) %>%
    mutate_at(vars(starts_with("Int")), funs(log10)) %>%
    mutate(Intensity.H.L = Intensity.H + Intensity.L,
           Intensity.H.M = Intensity.H + Intensity.M,
           Intensity.M.L = Intensity.M + Intensity.L)

14.6 group_by()

So far we’ve seen transformations and aggregation functions on the entire data set, but what of the split component in split-apply-combine? The final piece of the puzzle is the group\_by() function, which tells R how to group data. Let’s return to the play data set from earlier and see if we can accomplish our last three aggregation scenarios:

# Scenario 1: mean of height & width at each type and time
PlayData_t %>%
    group_by(type, time) %>%
    summarise(avg = mean(value))

# # A tibble: 4 x 3
# # Groups:   type [2]
#   type   time   avg
#   <chr> <dbl> <dbl>
# 1 A         1    30
# 2 A         2    40
# 3 B         1    50
# 4 B         2    60

# Scenario 2: mean of time 1 & 2 for each type:
PlayData_t %>%
    group_by(type, key) %>%
    summarise(avg = mean(value))

# # A tibble: 4 x 3
# # Groups:   type [2]
#   type  key      avg
#   <chr> <chr>  <dbl>
# 1 A     height    15
# 2 A     width     55
# 3 B     height    35
# 4 B     width     75

# Scenario 3: mean of A & B at each time point:
PlayData_t %>%
    group_by(time, key) %>%
    summarise(avg = mean(value))

# # A tibble: 4 x 3
# # Groups:   time [2]
#    time key      avg
#   <dbl> <chr>  <dbl>
# 1     1 height    20
# 2     1 width     60
# 3     2 height    30
# 4     2 width     70

See page ?? for a discussion of attributes.↩︎
dplyr is the data-frame centric cousin of an earlier package called plyr. The name plyr brings to mind the word plier, a hand-held tool used to hold, compress and transform all variety of materials.↩︎
When speaking the commands say then.↩︎
Notice the use of the third possible assign operator here, ->, which allows us to readl dplyr commands like grammatically correct sentences: subject %>% verb -> object.↩︎
The - notation is like how we use != in logical expressions in in [] to not select specific observations.↩︎
We also have the possibility to specify more than one function using the funs() function, which we’ll see in the case study in the next chapter.↩︎