3 The Grammar of Data Analysis with dplyr
3.1 Learning Objectives
The tidyverse
ecosystem is very data frame-centric and often relies on tibbles. dplyr
is a core component of the tidyverse
and is comprised of three main components:
3.2 The grammar of data analysis, dplyr
Moving forward we’ll use the built-in data frame called PlantGrowth
instead of the PG_long
data frame we created in the last chapter. You’ll recognize that it’s the same data frame.
Split-apply-combine refers to a series of actions that are often repeated in data analysis and statistics. A data-set is
- Split into sub-groups defined by a categorical (aka factor) variable. A function is then
- Applied to each individual sub-set, and the results are
- Combined into a new data-set or added onto the original data-set.
There are many ways to perform split-apply-combine operations in R, so this is nothing new.
The apply family of functions are powerful are useful base
package functions. Typical base
package functions include apply
, tapply()
, sapply()
, lapply()
, mapply()
, by()
, and aggregrate()
.
However, they proved difficult to master for new-comers since the class for the input and output of each function was different, and not obvious. e.g. Can you identify what the type and structure of this output is:
tapply(PlantGrowth$weight, PlantGrowth$group, mean)
#> ctrl trt1 trt2
#> 5.032 4.661 5.526
For a newbie, that is no easy task! Recall, one of the biggest problems you’ll encounter is having data in the wrong class!
There has been a major effort within the R community to develop easier tools for performing these tasks. The dplyr
package has now come to dominate split-apply-combine tasks, largely because the commands are intuitive and syntactically uniform.
The many packages which have moved R into this direction are called the tidyverse.
The dplyr
package takes the analogy of the program language closer to spoken language by referring to the data frames as nouns and the actions performed as verbs. You’ll also encounter the punctuation %>%
, an adverb group_by()
and a pronoun .data
. The above command is written as below. It’s even more flexible since we can easily ass more functions to summarise.
3.3 Tibbles
In this section we’ll use a larger file. It’s a collection of proteins and some attributes from an experiment to quantify their abundance in various cell lines. Let’s get started by loading the package and reading in a text file called protein.txt
. You can find it using the command below.
The first thing we’d want to do with our data frame is convert it to a tibble. Actually we don’t need to since we used tidyr
functions.
class(protein)
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Recall, when we call:
protein
#> # A tibble: 1,223 × 15
#> Description Uniprot Peptides MW.kDa Sequence.Length
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 >ENSEMBL:ENSBT… <NA> 14 193. 1742
#> 2 Tubulin alpha-… TBA1B_MO… 10 50.2 451
#> 3 >ENSEMBL:ENSBT… <NA> 13 164. 1477
#> 4 >ENSEMBL:ENSBT… <NA> 3 65.2 578
#> 5 >P01030 SWISS-… <NA> 14 193. 1741
#> 6 >P01966 SWISS-… <NA> 3 15.2 142
#> 7 >P02769 SWISS-… <NA> 7 69.3 607
#> 8 >P12763 SWISS-… <NA> 3 38.4 359
#> 9 >P15636 SWISS-… <NA> 1 68.1 653
#> 10 >P34955 SWISS-… <NA> 9 46.1 416
#> # … with 1,213 more rows, and 10 more variables:
#> # Ratio.M.L <dbl>, Ratio.M.L.Sig <dbl>, Ratio.H.L <dbl>,
#> # Ratio.H.L.Sig <dbl>, Ratio.H.M <dbl>,
#> # Ratio.H.M.Sig <dbl>, Intensity.L <dbl>,
#> # Intensity.M <dbl>, Intensity.H <dbl>, Contaminant <chr>
it will print nicer. Another really useful dplyr
function is glimpse()
:
glimpse(protein)
#> Rows: 1,223
#> Columns: 15
#> $ Description <chr> ">ENSEMBL:ENSBTAP00000007350 (Bos …
#> $ Uniprot <chr> NA, "TBA1B_MOUSE;O89052_MOUSE;Q99K…
#> $ Peptides <dbl> 14, 10, 13, 3, 14, 3, 7, 3, 1, 9, …
#> $ MW.kDa <dbl> 192.990, 50.151, 164.340, 65.190, …
#> $ Sequence.Length <dbl> 1742, 451, 1477, 578, 1741, 142, 6…
#> $ Ratio.M.L <dbl> 0.065122, 0.519120, 0.103550, 0.07…
#> $ Ratio.M.L.Sig <dbl> 6.4800e-08, 3.4958e-01, 2.7800e-05…
#> $ Ratio.H.L <dbl> 0.023451, 0.285370, 0.049841, 0.04…
#> $ Ratio.H.L.Sig <dbl> 2.0200e-08, 2.8918e-01, 7.3900e-05…
#> $ Ratio.H.M <dbl> 0.35098, 0.49988, 0.44963, 0.47136…
#> $ Ratio.H.M.Sig <dbl> 0.01815200, 0.25055000, 0.19706000…
#> $ Intensity.L <dbl> 91981000, 30801000, 35614000, 9823…
#> $ Intensity.M <dbl> 9215400, 18667000, 8236200, 182610…
#> $ Intensity.H <dbl> 5263600, 10136000, 4904200, 120500…
#> $ Contaminant <chr> "+", "+", "+", "+", "+", "+", "+",…
This is like str()
, but it provides a more human-friendly output. Those are nice new functions, but let’s move onto the real purpose of dplyr
.
There are three main components to understanding dplyr
1.
- The pipe operator
- The five verbs plus the helper functions, and
- An adverb
3.4 The Five Verbs
There are five verbs which we can use to act upon our noun:
Operates on | Verb | Description | Section |
---|---|---|---|
Observations | filter() |
Filter observations given specific criteria (see subset() ). |
Section 3.4.1 |
Observations | arrange() |
Rearrange observations given specific criteria (see order() ). |
Section 3.4.2 |
Variables | select() |
Select variables meeting specific criteria. | Section 3.4.3 |
Variables | mutate() |
Apply transformation functions on selected variables. | Section 3.4.4 |
Variables | summarise() |
Apply aggregation functions on selected variables. | Section 3.4.4 |
Let’s take a look at these functions in the context of operations we already understand
3.4.1 filter()
Recall that we can use []
to filter a data frame.
protein[protein$Contaminant != "+", ]
The above []
call is equivalent to filter()
, except it the handling of NA values. Here, filter()
is just the dplyr
version:
protein %>%
filter(Contaminant != "+")
The advantage is that it’s easy to combine many logical expressions with an & using a comma:
protein %>%
filter(Contaminant != "+", Ratio.M.L.Sig < 0.05)
3.4.2 arrange()
Previously, we extracted the Uniprot IDs having the top 20 M/L ratios. First we had to get the index of the order and then apply that to the data frame. Here, we can use the arrange()
function:2
protein %>%
arrange(desc(Ratio.M.L)) %>%
select(Uniprot) %>%
.[1:20,] -> topML
Some alternatives for the last line in the above commands:
slice(1:20) -> topML
or
head(20) -> topML
The result is a one-column data frame. Remember, dplyr
is very data frame centric.
glimpse(topML)
#> Rows: 20
#> Columns: 1
#> $ Uniprot <chr> "ANXA2_MOUSE;Q3UCD3_MOUSE;Q542G9_MOUSE;Q99…
3.4.3 select()
select()
is used for choosing specific columns.
protein %>%
select(Ratio.M.L, Ratio.H.L, Ratio.H.M)
#> # A tibble: 1,223 × 3
#> Ratio.M.L Ratio.H.L Ratio.H.M
#> <dbl> <dbl> <dbl>
#> 1 0.0651 0.0235 0.351
#> 2 0.519 0.285 0.500
#> 3 0.104 0.0498 0.450
#> 4 0.0728 0.0405 0.471
#> 5 NA NA NA
#> 6 0.174 0.166 0.997
#> 7 0.0976 0.0793 0.661
#> 8 0.187 0.0374 0.249
#> 9 NA NA NA
#> 10 0.0656 0.0305 0.421
#> # … with 1,213 more rows
# recall indexing:
# foo.df[1:2] # first two columns
# foo.df[-3] # all except the 3rd
This is all fine and good, but we also have a number of helper functions that we can use to access columns.
Helper function | Description |
---|---|
starts_with(x) |
Names starts with x. |
ends_with(x) |
Names ends in x. |
contains(x) |
Selects variable names containing x. |
matches(x) |
Selects all variable names matching the regex x. |
num_range("x", 1:5) |
Selects x1 to x5. |
one_of("x", "y", "z") |
Selects variables in a character vector. |
everything() |
Selects all variables. |
protein %>%
select(starts_with("Rat"))
#> # A tibble: 1,223 × 6
#> Ratio.M.L Ratio.M.L.Sig Ratio.H.L Ratio.H.L.Sig Ratio.H.M
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0651 0.0000000648 0.0235 0.0000000202 0.351
#> 2 0.519 0.350 0.285 0.289 0.500
#> 3 0.104 0.0000278 0.0498 0.0000739 0.450
#> 4 0.0728 0.0000516 0.0405 0.0000222 0.471
#> 5 NA NA NA NA NA
#> 6 0.174 0.0128 0.166 0.110 0.997
#> 7 0.0976 0.000014 0.0793 0.000705 0.661
#> 8 0.187 0.00418 0.0374 0.00000167 0.249
#> 9 NA NA NA NA NA
#> 10 0.0656 0.00000016 0.0305 0.000000179 0.421
#> # … with 1,213 more rows, and 1 more variable:
#> # Ratio.H.M.Sig <dbl>
Remember, we can also use the -
notation (like we did in []
) to not select specific columns.3
protein %>%
select(starts_with("Rat"), -ends_with("Sig"))
#> # A tibble: 1,223 × 3
#> Ratio.M.L Ratio.H.L Ratio.H.M
#> <dbl> <dbl> <dbl>
#> 1 0.0651 0.0235 0.351
#> 2 0.519 0.285 0.500
#> 3 0.104 0.0498 0.450
#> 4 0.0728 0.0405 0.471
#> 5 NA NA NA
#> 6 0.174 0.166 0.997
#> 7 0.0976 0.0793 0.661
#> 8 0.187 0.0374 0.249
#> 9 NA NA NA
#> 10 0.0656 0.0305 0.421
#> # … with 1,213 more rows
OK, so now that we have our columns, we can pass them along to some functions:
protein %>%
select(starts_with("Rat"), -ends_with("Sig")) %>%
log2()
#> # A tibble: 1,223 × 3
#> Ratio.M.L Ratio.H.L Ratio.H.M
#> <dbl> <dbl> <dbl>
#> 1 -3.94 -5.41 -1.51
#> 2 -0.946 -1.81 -1.00
#> 3 -3.27 -4.33 -1.15
#> 4 -3.78 -4.63 -1.09
#> 5 NA NA NA
#> 6 -2.52 -2.59 -0.00464
#> 7 -3.36 -3.66 -0.598
#> 8 -2.42 -4.74 -2.00
#> 9 NA NA NA
#> 10 -3.93 -5.04 -1.25
#> # … with 1,213 more rows
In this very simplistic view, we need to attach the original data frame back to the \(log_2\) transformed columns. This is doable, but there are better ways. For example, we can use the dplyr
function bind_cols()
. This is a dplyr
version of cbind()
.
protein %>%
select(starts_with("Rat"), -ends_with("Sig")) %>%
log2() %>%
bind_cols(protein)
#> New names:
#> * Ratio.M.L -> Ratio.M.L...1
#> * Ratio.H.L -> Ratio.H.L...2
#> * Ratio.H.M -> Ratio.H.M...3
#> * Ratio.M.L -> Ratio.M.L...9
#> * Ratio.H.L -> Ratio.H.L...11
#> * ...
#> # A tibble: 1,223 × 18
#> Ratio.M.L...1 Ratio.H.L...2 Ratio.H.M...3 Description
#> <dbl> <dbl> <dbl> <chr>
#> 1 -3.94 -5.41 -1.51 >ENSEMBL:ENSBT…
#> 2 -0.946 -1.81 -1.00 Tubulin alpha-…
#> 3 -3.27 -4.33 -1.15 >ENSEMBL:ENSBT…
#> 4 -3.78 -4.63 -1.09 >ENSEMBL:ENSBT…
#> 5 NA NA NA >P01030 SWISS-…
#> 6 -2.52 -2.59 -0.00464 >P01966 SWISS-…
#> 7 -3.36 -3.66 -0.598 >P02769 SWISS-…
#> 8 -2.42 -4.74 -2.00 >P12763 SWISS-…
#> 9 NA NA NA >P15636 SWISS-…
#> 10 -3.93 -5.04 -1.25 >P34955 SWISS-…
#> # … with 1,213 more rows, and 14 more variables:
#> # Uniprot <chr>, Peptides <dbl>, MW.kDa <dbl>,
#> # Sequence.Length <dbl>, Ratio.M.L...9 <dbl>,
#> # Ratio.M.L.Sig <dbl>, Ratio.H.L...11 <dbl>,
#> # Ratio.H.L.Sig <dbl>, Ratio.H.M...13 <dbl>,
#> # Ratio.H.M.Sig <dbl>, Intensity.L <dbl>,
#> # Intensity.M <dbl>, Intensity.H <dbl>, …
There are even better ways, which we’ll see in the next section.
3.4.4 mutate() & summarise()
Remember, we have two basic types of functions we want to apply to our data:
-
Transformation using
mutate()
, and -
Aggregations using
summarise()
Some common normalizations (transformations) include z-scores and scaling on a min-max range [0,1]:
PlantGrowth %>%
mutate(Z_score = (weight - mean(weight))/sd(weight),
min.max = (weight - min(weight)) / (max(weight) - min(weight)))
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> # A tibble: 30 × 4
#> weight group Z_score min.max
#> <dbl> <fct> <dbl> <dbl>
#> 1 4.17 ctrl -1.29 0.213
#> 2 5.58 ctrl 0.723 0.732
#> 3 5.18 ctrl 0.153 0.585
#> 4 6.11 ctrl 1.48 0.926
#> 5 4.5 ctrl -0.817 0.335
#> 6 4.61 ctrl -0.660 0.375
#> 7 5.17 ctrl 0.138 0.581
#> 8 4.53 ctrl -0.774 0.346
#> 9 5.33 ctrl 0.367 0.640
#> 10 5.14 ctrl 0.0956 0.570
#> # … with 20 more rows
Descriptive statistics are aggregation functions:
PlantGrowth %>%
summarise(averge = mean(weight),
st.dev = sd(weight))
#> averge st.dev
#> 1 5.073 0.7011918
If we see what we did previously with the \(log_{2}\) ratios, you may decide we need something more elegant. For example in protein
, we want to transform (i.e. mutate()
) 3 columns using \(log_{2}\) and another 3 columns using \(log_{10}\). For starters, we can do this:
protein %>%
select(starts_with("Rat"), -ends_with("Sig")) %>%
mutate(Ratio.M.L_Log2 = log2(Ratio.M.L),
Ratio.H.L_Log2 = log2(Ratio.H.L),
Ratio.H.M_Log2 = log2(Ratio.H.M))
#> # A tibble: 1,223 × 6
#> Ratio.M.L Ratio.H.L Ratio.H.M Ratio.M.L_Log2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0651 0.0235 0.351 -3.94
#> 2 0.519 0.285 0.500 -0.946
#> 3 0.104 0.0498 0.450 -3.27
#> 4 0.0728 0.0405 0.471 -3.78
#> 5 NA NA NA NA
#> 6 0.174 0.166 0.997 -2.52
#> 7 0.0976 0.0793 0.661 -3.36
#> 8 0.187 0.0374 0.249 -2.42
#> 9 NA NA NA NA
#> 10 0.0656 0.0305 0.421 -3.93
#> # … with 1,213 more rows, and 2 more variables:
#> # Ratio.H.L_Log2 <dbl>, Ratio.H.M_Log2 <dbl>
But you can imagine that that’s pretty tedious. So to help automate things, we have a series of helper functions.
Function variant | Description | Section |
---|---|---|
*_if() |
Conditional mapping | Section 3.4.5 |
*_at() |
Specify columns | Section 3.4.6 |
*_all() |
All variables | Not discussed |
*_each() |
Each, renaming | Not discussed |
3.4.5 summarise_if(), mutate_if()
_if
variants mutate or summarise given a conditional equation.
PlantGrowth %>%
summarise_if(is.numeric, mean)
#> weight
#> 1 5.073
3.4.6 summarise_at(), mutate_at()
summarise_at()
allows us to bypass select()
by using the helper functions with the vars()
function.4 For example, we can get the mean of specific columns we are interested in:
protein %>%
summarise_at(vars(Peptides, MW.kDa, Sequence.Length), mean)
Or, we can perform transformations on specific columns:
protein %>%
mutate_at(vars(starts_with("Rat"), -ends_with("Sig")), log2)
This is what we’ll do here, because we can mutate in-situ:
protein %>%
filter(Contaminant != "+") %>%
mutate_at(vars(starts_with("Rat"), -ends_with("Sig")), log2) %>%
mutate_at(vars(starts_with("Int")), funs(log10)) %>%
mutate(Intensity.H.L = Intensity.H + Intensity.L,
Intensity.H.M = Intensity.H + Intensity.M,
Intensity.M.L = Intensity.M + Intensity.L)
3.5 group_by()
So far we’ve seen transformations and aggregation functions on the entire data set, but what of the split component in split-apply-combine? The final piece of the puzzle is the group_by()
function, which tells R how to group data.
PlantGrowth %>%
group_by(group) %>%
summarise(avg = mean(weight))
#> # A tibble: 3 × 2
#> group avg
#> <fct> <dbl>
#> 1 ctrl 5.03
#> 2 trt1 4.66
#> 3 trt2 5.53
3.6 The glue package
The glue package is useful for creating strings as combinations of characters and computed values and is a bit easier that using paste()
.
library(tidyverse)
iris %>%
gather(key, value, -Species) %>%
group_by(Species, key) %>%
summarise(avg = mean(value)) -> iris_summary
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
plant <- "versicolor"
metric <- "Petal.Length"
iris_summary$avg[iris_summary$Species == plant & iris_summary$key == metric]
#> [1] 4.26
library(glue)
#>
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
glue('My average {metric} of {plant} is {iris_summary$avg[iris_summary$Species == plant & iris_summary$key == metric]}cm.')
#> My average Petal.Length of versicolor is 4.26cm.
library(dplyr)
head(iris) %>%
mutate(description = glue("This {Species} has a petal length of {Petal.Length}"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> description
#> 1 This setosa has a petal length of 1.4
#> 2 This setosa has a petal length of 1.4
#> 3 This setosa has a petal length of 1.3
#> 4 This setosa has a petal length of 1.5
#> 5 This setosa has a petal length of 1.4
#> 6 This setosa has a petal length of 1.7