13 Tidyverse
13.1 Learning Objectives
The dplyr
ecosystem is composed of three main components:
- The pipe operator
%>%
, previously discussed - The five verbs of
dplyr
:-
filter()
, -
arrange()
, -
select()
, -
mutate()
, -
summarise()
, and
-
-
group_by()
, previously discussed
13.2 Split-Apply-Combine with the dplyr Package
Split-apply-combine refers to a series of actions that are often repeated in data analysis and statistics. A data-set is
-
Split
into sub-groups defined by a categorical (aka factor) variable. A function is then -
Applied
to each individual sub-set, and the results are -
Combined
into a new data-set or added onto the original data-set.
There are many ways to perform split-apply-combine operations in R. Actually, we already saw this in action on the first day of the workshop (4).
PlantGrowth %>%
group_by(group) %>%
summarise(avg = mean(weight),
stdev = sd(weight))
#> # A tibble: 3 × 3
#> group avg stdev
#> <fct> <dbl> <dbl>
#> 1 trt2 5.53 0.443
#> 2 trt1 4.66 0.794
#> 3 ctrl 5.03 0.583
The apply family of functions are powerful are useful base
package functions. Typical base
package functions include apply
, tapply()
, sapply()
, lapply()
, mapply()
, by()
, and aggregrate()
. However, they proved difficult to master for new-comers since the class for the input and output of each function was different, and not obvious. Recall, one of the biggest problems you’ll encounter is having data in the wrong class! There has been a major effort within the R community to develop easier tools for performing these tasks. The dplyr
package has now come to dominate split-apply-combine tasks, largely because the commands are intuitive and syntactically uniform. The many packages which have moved R into this direction are called the tidyverse.
The dplyr
package takes the analogy of the program language closer to spoken language by referring to the data frames as nouns and the actions performed as verbs.
There are three main components to understanding dplyr
.44
- The pipe operator, as discussed previously
- The five verbs plus the helper functions
- An adverb
13.3 The Five Verbs
There are five verbs which we can use to act upon our noun:
Verb | Operates on | Description | Section |
---|---|---|---|
filter() |
Observations | Filter observations given specific criteria. | Section 13.3.1 |
arrange() |
Observations | Rearrange observations given specific criteria. | Section 13.3.2 |
select() |
Variables | Select variables meeting specific criteria. | Section 13.3.3 |
mutate() |
Variables | Apply transformation functions on selected variables. | Section 13.3.4 |
summarise() |
Variables | Apply aggregation functions on selected variables. | Section 13.3.4 |
Let’s take a look at these functions in the context of operations we already understand
13.3.1 filter()
Recall that we can use filter()
or []
to filter a data frame.
The above functions are equivalent.
The advantage is that it’s easy to combine many logical expressions with a comma:
13.3.2 arrange()
We can imagine various definitons of top values, such as:
- Top values defined by a certain cut-off, i.e. values above 35 (use a logical expression with the value)
- Top values according to a quantile, i.e. frac of the total (use a logical expression and define the cutoff dynamicall)
- Top n number of values, i.e. highest 5 values (
arrange()
and take thehead()
, or usetop_n()
)
We discussed the first two cases already. Here, we’ll use the arrange()
function:45
Alternatively, we can do this more explicitly:
Some alternatives for the last line in the above commands:
slice(1:20) -> top20cut
or
head(20) -> top20cut
The result is a one-column data frame. Remember, dplyr
is very data frame-centric.
glimpse(top20cut)
#> Rows: 20
#> Columns: 1
#> $ cut <ord> Premium, Very Good, Ideal, Ideal, Very Good, P…
13.3.3 select()
select()
is used for choosing specific columns.
# recall indexing:
foo_df[1:2] # first two columns
#> # A tibble: 6 × 2
#> healthy tissue
#> <lgl> <chr>
#> 1 TRUE Liver
#> 2 FALSE Brain
#> 3 FALSE Testes
#> 4 TRUE Muscle
#> 5 TRUE Intestine
#> 6 FALSE Heart
foo_df[-3] # all except the 3rd
#> # A tibble: 6 × 2
#> healthy tissue
#> <lgl> <chr>
#> 1 TRUE Liver
#> 2 FALSE Brain
#> 3 FALSE Testes
#> 4 TRUE Muscle
#> 5 TRUE Intestine
#> 6 FALSE Heart
This is all fine and good, but we also have a number of helper functions that we can use to access columns.
Helper function | Description |
---|---|
starts_with(x) |
Names starts with x. |
ends_with(x) |
Names ends in x. |
contains(x) |
Selects variable names containing x. |
matches(x) |
Selects all variable names matching the regex x. |
num_range("x", 1:5) |
Selects x1 to x5. |
one_of("x", "y", "z") |
Selects variables in a character vector. |
everything() |
Selects all variables. |
diamonds %>%
select(starts_with("c"))
Remember, we can also use the -
notation (like we did in []
) to not select specific columns.46
diamonds %>%
select(-starts_with("c"))
OK, so now that we have our columns, we can pass them along to some functions:
In this very simplistic view, we need to attach the original data frame back to the \(log_2\) transformed columns. This is doable, but there are better ways. For example, we can use the dplyr
function bind_cols()
. This is a dplyr
version of cbind()
.
There are even better ways, which we’ll see in the next section.
13.3.4 mutate() & summarise()
Remember, we have two basic types of functions we want to apply to our data:
-
Transformation using
mutate()
-
Aggregations using
summarise()
Some common normalisations (transformations) include z-scores and scaling on a min-max range [0,1]:
Descriptive statistics are aggregation functions: