Chapter 14 Element 7: Tidyverse – dplyr
14.1 Learning Objectives
The dplyr
ecosystem is composed of three main components:
- The pipe operator
%>%
, - The five verbs of
dplyr
:filter()
,arrange()
,select()
,mutate()
,summarise()
, and
group_by()
14.2 Split-Apply-Combine with the dplyr Package
Split-apply-combine refers to a series of actions that are often repeated in data analysis and statistics. A data-set is
Split
into sub-groups defined by a categorical (aka factor) variable. A function is thenApplied
to each individual sub-set, and the results areCombined
into a new data-set or added onto the original data-set.
There are many ways to perform split-apply-combine operations in R. Actually, we already saw this in action on the first day of the workshop (4).
# # A tibble: 3 x 3
# group avg stdev
# <fct> <dbl> <dbl>
# 1 trt2 5.53 0.443
# 2 trt1 4.66 0.794
# 3 ctrl 5.03 0.583
The apply family of functions are powerful are useful base
package functions.
base
package functions include apply
, tapply()
, sapply()
, lapply()
, mapply()
, by()
, and aggregrate()
.
dplyr
package has now come to dominate split-apply-combine tasks, largely because the commands are intuitive and syntactically uniform.
The dplyr
package takes the analogy of the program language closer to spoken language by referring to the data frames as nouns and the actions performed as verbs.
14.3 Tibbles
Let’s get started by loading the package. We’ll see some familiar operations done in a new way, so let’s re-import the protein.txt file, just to be sure we’re all on the same page.
library(tidyverse) # or only dply with library(dplyr)
protein_df <- read.delim("protein.txt", stringsAsFactors = FALSE)
The first thing we want to do with our data frame is convert it to a tibble. Actually we’re not changing anything about the data frame, it’s still a data frame, but we’re adding a new attribute, so that
# [1] "data.frame"
# [1] "data.frame"
just adds some attributes to the object:46
# [1] "tbl_df" "tbl" "data.frame"
The only reason we do this is so that when we call
# # A tibble: 1,223 x 15
# Description Uniprot Peptides MW.kDa Sequence.Length Ratio.M.L
# <chr> <chr> <int> <dbl> <int> <dbl>
# 1 >ENSEMBL:E… "" 14 193. 1742 0.0651
# 2 Tubulin al… "TBA1B… 10 50.2 451 0.519
# 3 >ENSEMBL:E… "" 13 164. 1477 0.104
# 4 >ENSEMBL:E… "" 3 65.2 578 0.0728
# 5 >P01030 SW… "" 14 193. 1741 NA
# 6 >P01966 SW… "" 3 15.2 142 0.174
# 7 >P02769 SW… "" 7 69.3 607 0.0976
# 8 >P12763 SW… "" 3 38.4 359 0.187
# 9 >P15636 SW… "" 1 68.1 653 NA
# 10 >P34955 SW… "" 9 46.1 416 0.0656
# # … with 1,213 more rows, and 9 more variables:
# # Ratio.M.L.Sig <dbl>, Ratio.H.L <dbl>, Ratio.H.L.Sig <dbl>,
# # Ratio.H.M <dbl>, Ratio.H.M.Sig <dbl>, Intensity.L <int>,
# # Intensity.M <int>, Intensity.H <int>, Contaminant <chr>
it will print nicer. Another really useful dplyr
function is glimpse()
:
# Rows: 1,223
# Columns: 15
# $ Description <chr> ">ENSEMBL:ENSBTAP00000007350 (Bos taurus)…
# $ Uniprot <chr> "", "TBA1B_MOUSE;O89052_MOUSE;Q99KA2_MOUS…
# $ Peptides <int> 14, 10, 13, 3, 14, 3, 7, 3, 1, 9, 9, 1, 7…
# $ MW.kDa <dbl> 193, 50, 164, 65, 193, 15, 69, 38, 68, 46…
# $ Sequence.Length <int> 1742, 451, 1477, 578, 1741, 142, 607, 359…
# $ Ratio.M.L <dbl> 0.065, 0.519, 0.104, 0.073, NA, 0.174, 0.…
# $ Ratio.M.L.Sig <dbl> 6.5e-08, 3.5e-01, 2.8e-05, 5.2e-05, NA, 1…
# $ Ratio.H.L <dbl> 0.023, 0.285, 0.050, 0.040, NA, 0.166, 0.…
# $ Ratio.H.L.Sig <dbl> 2.0e-08, 2.9e-01, 7.4e-05, 2.2e-05, NA, 1…
# $ Ratio.H.M <dbl> 0.35, 0.50, 0.45, 0.47, NA, 1.00, 0.66, 0…
# $ Ratio.H.M.Sig <dbl> 0.01815, 0.25055, 0.19706, 0.26863, NA, 0…
# $ Intensity.L <int> 91981000, 30801000, 35614000, 9823100, 19…
# $ Intensity.M <int> 9215400, 18667000, 8236200, 1826100, 9003…
# $ Intensity.H <int> 5263600, 10136000, 4904200, 1205000, 1443…
# $ Contaminant <chr> "+", "+", "+", "+", "+", "+", "+", "+", "…
This is like str()
, but it provides a more human-friendly output. Those are nice new functions, but let’s move onto the real purpose of dplyr
.
There are three main components to understanding dplyr
.47
- The pipe operator
- The five verbs plus the helper functions
- An adverb
14.4 The pipe operator - %>%
The pipe operator is a shorthand for calling functions, such that:
function(x) == x \%>\% function()
48
This is not specific to dplyr
functions. It can be used with any R
functions:
# [1] 5.5
# [1] 5.5
The advantage here is that we can string together a long series of functions that would be very difficult to read in the regular R
nomenclature, which we’ll see in a minute as our functions get more complex.
14.5 The Five Verbs
There are five verbs which we can use to act upon our noun:
Verb | Description | Section |
---|---|---|
Operating on observations | ||
filter() |
Filter observations given specific criteria (see subset() ). |
Section 14.5.1 |
arrange() |
Rearrange observations given specific criteria (see order() ). |
Section 14.5.2 |
Operating on variables | ||
select() |
Select variables meeting specific criteria. | Section 14.5.3 |
mutate() |
Apply transformation functions on selected variables. | Section 14.5.4 |
summarise() |
Apply aggregation functions on selected variables. | Section 14.5.4 |
Let’s take a look at these functions in the context of operations we already understand
14.5.1 filter()
Recall that we can use subset()
or []
to filter a data frame.
The above functions are equivalent. Here, filter()
is just the dplyr
version:
The advantage is that it’s easy to combine many logical expressions with an & using a comma:
14.5.2 arrange()
Previously, we extracted the Uniprot IDs having the top 20 M/L ratios. First we had to get the index of the order and then apply that to the data frame. Here, we can use the arrange()
function:49
Some alternatives for the last line in the above commands:
slice(1:20) -> topML
or
head(20) -> topML
The result is a one-column data frame. Remember, dplyr
is very data frame centric.
# Rows: 20
# Columns: 1
# $ Uniprot <chr> "ANXA2_MOUSE;Q3UCD3_MOUSE;Q542G9_MOUSE;Q99KH3_MOU…
14.5.3 select()
select()
is used for choosing specific columns.
# # A tibble: 1,223 x 3
# Ratio.M.L Ratio.H.L Ratio.H.M
# <dbl> <dbl> <dbl>
# 1 0.0651 0.0235 0.351
# 2 0.519 0.285 0.500
# 3 0.104 0.0498 0.450
# 4 0.0728 0.0405 0.471
# 5 NA NA NA
# 6 0.174 0.166 0.997
# 7 0.0976 0.0793 0.661
# 8 0.187 0.0374 0.249
# 9 NA NA NA
# 10 0.0656 0.0305 0.421
# # … with 1,213 more rows
# healthy tissue
# 1 TRUE Liver
# 2 FALSE Brain
# 3 FALSE Testes
# 4 TRUE Muscle
# 5 TRUE Intestine
# 6 FALSE Heart
# healthy tissue
# 1 TRUE Liver
# 2 FALSE Brain
# 3 FALSE Testes
# 4 TRUE Muscle
# 5 TRUE Intestine
# 6 FALSE Heart
This is all fine and good, but we also have a number of helper functions that we can use to access columns.
Helper function | Description |
---|---|
starts_with(x) |
Names starts with x. |
ends_with(x) |
Names ends in x. |
contains(x) |
Selects variable names containing x. |
matches(x) |
Selects all variable names matching the regex x. |
num_range("x", 1:5) |
Selects x1 to x5. |
one_of("x", "y", "z") |
Selects variables in a character vector. |
everything() |
Selects all variables. |
# # A tibble: 1,223 x 6
# Ratio.M.L Ratio.M.L.Sig Ratio.H.L Ratio.H.L.Sig Ratio.H.M
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.0651 0.0000000648 0.0235 0.0000000202 0.351
# 2 0.519 0.350 0.285 0.289 0.500
# 3 0.104 0.0000278 0.0498 0.0000739 0.450
# 4 0.0728 0.0000516 0.0405 0.0000222 0.471
# 5 NA NA NA NA NA
# 6 0.174 0.0128 0.166 0.110 0.997
# 7 0.0976 0.000014 0.0793 0.000705 0.661
# 8 0.187 0.00418 0.0374 0.00000167 0.249
# 9 NA NA NA NA NA
# 10 0.0656 0.00000016 0.0305 0.000000179 0.421
# # … with 1,213 more rows, and 1 more variable: Ratio.H.M.Sig <dbl>
Remember, we can also use the -
notation (like we did in []
) to not select specific columns.50
# # A tibble: 1,223 x 3
# Ratio.M.L Ratio.H.L Ratio.H.M
# <dbl> <dbl> <dbl>
# 1 0.0651 0.0235 0.351
# 2 0.519 0.285 0.500
# 3 0.104 0.0498 0.450
# 4 0.0728 0.0405 0.471
# 5 NA NA NA
# 6 0.174 0.166 0.997
# 7 0.0976 0.0793 0.661
# 8 0.187 0.0374 0.249
# 9 NA NA NA
# 10 0.0656 0.0305 0.421
# # … with 1,213 more rows
OK, so now that we have our columns, we can pass them along to some functions:
# # A tibble: 1,223 x 3
# Ratio.M.L Ratio.H.L Ratio.H.M
# <dbl> <dbl> <dbl>
# 1 -3.94 -5.41 -1.51
# 2 -0.946 -1.81 -1.00
# 3 -3.27 -4.33 -1.15
# 4 -3.78 -4.63 -1.09
# 5 NA NA NA
# 6 -2.52 -2.59 -0.00464
# 7 -3.36 -3.66 -0.598
# 8 -2.42 -4.74 -2.00
# 9 NA NA NA
# 10 -3.93 -5.04 -1.25
# # … with 1,213 more rows
In this very simplistic view, we need to attach the original data frame back to the \(log_2\) transformed columns. This is doable, but there are better ways. For example, we can use the dplyr
function bind\_cols()
. This is a dplyr
version of cbind()
.
# # A tibble: 1,223 x 18
# Ratio.M.L...1 Ratio.H.L...2 Ratio.H.M...3 Description Uniprot
# <dbl> <dbl> <dbl> <chr> <chr>
# 1 -3.94 -5.41 -1.51 >ENSEMBL:E… ""
# 2 -0.946 -1.81 -1.00 Tubulin al… "TBA1B…
# 3 -3.27 -4.33 -1.15 >ENSEMBL:E… ""
# 4 -3.78 -4.63 -1.09 >ENSEMBL:E… ""
# 5 NA NA NA >P01030 SW… ""
# 6 -2.52 -2.59 -0.00464 >P01966 SW… ""
# 7 -3.36 -3.66 -0.598 >P02769 SW… ""
# 8 -2.42 -4.74 -2.00 >P12763 SW… ""
# 9 NA NA NA >P15636 SW… ""
# 10 -3.93 -5.04 -1.25 >P34955 SW… ""
# # … with 1,213 more rows, and 13 more variables: Peptides <int>,
# # MW.kDa <dbl>, Sequence.Length <int>, Ratio.M.L...9 <dbl>,
# # Ratio.M.L.Sig <dbl>, Ratio.H.L...11 <dbl>, Ratio.H.L.Sig <dbl>,
# # Ratio.H.M...13 <dbl>, Ratio.H.M.Sig <dbl>, Intensity.L <int>,
# # Intensity.M <int>, Intensity.H <int>, Contaminant <chr>
There are even better ways, which we’ll see in the next section.
14.5.4 mutate() & summarise()
Remember, we have two basic types of functions we want to apply to our data:
Transformation using mutate()
, and
Aggregations using summarise()
Some common normalisations (transformations) include z-scores and scaling on a min-max range [0,1]:
PlantGrowth %>%
mutate(Z_score = (weight - mean(weight))/sd(weight),
min.max = (weight - min(weight)) / (max(weight) - min(weight)))
# # A tibble: 30 x 4
# weight group Z_score min.max
# <dbl> <fct> <dbl> <dbl>
# 1 4.17 ctrl -1.29 0.213
# 2 5.58 ctrl 0.723 0.732
# 3 5.18 ctrl 0.153 0.585
# 4 6.11 ctrl 1.48 0.926
# 5 4.5 ctrl -0.817 0.335
# 6 4.61 ctrl -0.660 0.375
# 7 5.17 ctrl 0.138 0.581
# 8 4.53 ctrl -0.774 0.346
# 9 5.33 ctrl 0.367 0.640
# 10 5.14 ctrl 0.0956 0.570
# # … with 20 more rows
Descriptive statistics are aggregation functions:
# # A tibble: 1 x 2
# averge st.dev
# <dbl> <dbl>
# 1 5.07 0.701
If we see what we did previously with the \(log_{2}\) ratios, you may decide we need something more elegant. For example in protein_df
, we want to transform (i.e. mutate()
) 3 columns using \(log_{2}\) and another 3 columns using \(log_{10}\). For starters, we can do this:
protein_df %>%
select(starts_with("Rat"), -ends_with("Sig")) %>%
mutate(Ratio.M.L_Log2 = log2(Ratio.M.L),
Ratio.H.L_Log2 = log2(Ratio.H.L),
Ratio.H.M_Log2 = log2(Ratio.H.M))
# # A tibble: 1,223 x 6
# Ratio.M.L Ratio.H.L Ratio.H.M Ratio.M.L_Log2 Ratio.H.L_Log2
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.0651 0.0235 0.351 -3.94 -5.41
# 2 0.519 0.285 0.500 -0.946 -1.81
# 3 0.104 0.0498 0.450 -3.27 -4.33
# 4 0.0728 0.0405 0.471 -3.78 -4.63
# 5 NA NA NA NA NA
# 6 0.174 0.166 0.997 -2.52 -2.59
# 7 0.0976 0.0793 0.661 -3.36 -3.66
# 8 0.187 0.0374 0.249 -2.42 -4.74
# 9 NA NA NA NA NA
# 10 0.0656 0.0305 0.421 -3.93 -5.04
# # … with 1,213 more rows, and 1 more variable: Ratio.H.M_Log2 <dbl>
But you can imagine that that’s pretty tedious. So to help automate things, we have a series of helper functions.
Function variant | Description | Section |
---|---|---|
*_if() |
Conditional mapping | Section 14.5.5 |
*_at() |
Specify columns | Section 14.5.6 |
*_all() |
All variables | Not discussed |
*_each() |
Each, renaming | Not discussed |
14.5.5 summarise_if(), mutate_if()
_if
variants mutate or summarise given a conditional equation.
# # A tibble: 1 x 1
# weight
# <dbl>
# 1 5.07
# # A tibble: 30 x 2
# weight group
# <dbl> <fct>
# 1 -1.29 ctrl
# 2 0.723 ctrl
# 3 0.153 ctrl
# 4 1.48 ctrl
# 5 -0.817 ctrl
# 6 -0.660 ctrl
# 7 0.138 ctrl
# 8 -0.774 ctrl
# 9 0.367 ctrl
# 10 0.0956 ctrl
# # … with 20 more rows
14.5.6 summarise_at(), mutate_at()
summarise\_at()
allows us to bypass select()
by using the helper functions with the vars()
function.51 For example, we can get the mean of specific columns we are interested in:
Or, we can perform transformations on specific columns:
This is what we’ll do here, because we can mutate in-situ:
14.6 group_by()
So far we’ve seen transformations and aggregation functions on the entire data set, but what of the split component in split-apply-combine? The final piece of the puzzle is the group\_by()
function, which tells R how to group data. Let’s return to the play data set from earlier and see if we can accomplish our last three aggregation scenarios:
# Scenario 1: mean of height & width at each type and time
PlayData_t %>%
group_by(type, time) %>%
summarise(avg = mean(value))
# # A tibble: 4 x 3
# # Groups: type [2]
# type time avg
# <chr> <dbl> <dbl>
# 1 A 1 30
# 2 A 2 40
# 3 B 1 50
# 4 B 2 60
# Scenario 2: mean of time 1 & 2 for each type:
PlayData_t %>%
group_by(type, key) %>%
summarise(avg = mean(value))
# # A tibble: 4 x 3
# # Groups: type [2]
# type key avg
# <chr> <chr> <dbl>
# 1 A height 15
# 2 A width 55
# 3 B height 35
# 4 B width 75
# Scenario 3: mean of A & B at each time point:
PlayData_t %>%
group_by(time, key) %>%
summarise(avg = mean(value))
# # A tibble: 4 x 3
# # Groups: time [2]
# time key avg
# <dbl> <chr> <dbl>
# 1 1 height 20
# 2 1 width 60
# 3 2 height 30
# 4 2 width 70
dplyr
is the data-frame centric cousin of an earlier package calledplyr
. The nameplyr
brings to mind the word plier, a hand-held tool used to hold, compress and transform all variety of materials.↩︎When speaking the commands say then.↩︎
Notice the use of the third possible assign operator here,
->
, which allows us to readldplyr
commands like grammatically correct sentences: subject %>% verb -> object.↩︎The
-
notation is like how we use!=
in logical expressions in in[]
to not select specific observations.↩︎We also have the possibility to specify more than one function using the
funs()
function, which we’ll see in the case study in the next chapter.↩︎