Chapter 13 2.1.6 Tidyverse
13.1 Learning Objectives
The dplyr
ecosystem is composed of three main components:
- The pipe operator
%>%
, previously discussed - The five verbs of
dplyr
:filter()
,arrange()
,select()
,mutate()
,summarise()
, and
group_by()
, previously discussed
13.2 Split-Apply-Combine with the dplyr Package
Split-apply-combine refers to a series of actions that are often repeated in data analysis and statistics. A data-set is
Split
into sub-groups defined by a categorical (aka factor) variable. A function is thenApplied
to each individual sub-set, and the results areCombined
into a new data-set or added onto the original data-set.
There are many ways to perform split-apply-combine operations in R. Actually, we already saw this in action on the first day of the workshop (4).
%>%
PlantGrowth group_by(group) %>%
summarise(avg = mean(weight),
stdev = sd(weight))
## # A tibble: 3 × 3
## group avg stdev
## <fct> <dbl> <dbl>
## 1 trt2 5.53 0.443
## 2 trt1 4.66 0.794
## 3 ctrl 5.03 0.583
The apply family of functions are powerful are useful base
package functions. Typical base
package functions include apply
, tapply()
, sapply()
, lapply()
, mapply()
, by()
, and aggregrate()
. However, they proved difficult to master for new-comers since the class for the input and output of each function was different, and not obvious. Recall, one of the biggest problems you’ll encounter is having data in the wrong class! There has been a major effort within the R community to develop easier tools for performing these tasks. The dplyr
package has now come to dominate split-apply-combine tasks, largely because the commands are intuitive and syntactically uniform. The many packages which have moved R into this direction are called the tidyverse.
The dplyr
package takes the analogy of the program language closer to spoken language by referring to the data frames as nouns and the actions performed as verbs.
There are three main components to understanding dplyr
.44
- The pipe operator, as discussed previously
- The five verbs plus the helper functions
- An adverb
13.3 The Five Verbs
There are five verbs which we can use to act upon our noun:
Verb | Operates on | Description | Section |
---|---|---|---|
filter() |
Observations | Filter observations given specific criteria. | Section 13.3.1 |
arrange() |
Observations | Rearrange observations given specific criteria. | Section 13.3.2 |
select() |
Variables | Select variables meeting specific criteria. | Section 13.3.3 |
mutate() |
Variables | Apply transformation functions on selected variables. | Section 13.3.4 |
summarise() |
Variables | Apply aggregation functions on selected variables. | Section 13.3.4 |
Let’s take a look at these functions in the context of operations we already understand
13.3.1 filter()
Recall that we can use filter()
or []
to filter a data frame.
%>%
diamonds filter(clarity == "VVS1")
# is the same as
$clarity == "VVS1", ] diamonds[diamonds
The above functions are equivalent.
The advantage is that it’s easy to combine many logical expressions with a comma:
%>%
diamonds filter(clarity == "VVS1", price < 1000)
13.3.2 arrange()
We can imagine various definitons of top values, such as:
- Top values defined by a certain cut-off, i.e. values above 35 (use a logical expression with the value)
- Top values according to a quantile, i.e. frac of the total (use a logical expression and define the cutoff dynamicall)
- Top n number of values, i.e. highest 5 values (
arrange()
and take thehead()
, or usetop_n()
)
We discussed the first two cases already. Here, we’ll use the arrange()
function:45
%>%
diamonds top_n(20, price) %>%
select(cut) -> top20cut
Alternatively, we can do this more explicitly:
%>%
diamonds arrange(desc(price)) %>%
select(cut) %>%
1:20,] -> top20cut .[
Some alternatives for the last line in the above commands:
slice(1:20) -> top20cut
or
head(20) -> top20cut
The result is a one-column data frame. Remember, dplyr
is very data frame-centric.
glimpse(top20cut)
## Rows: 20
## Columns: 1
## $ cut <ord> Premium, Very Good, Ideal, Ideal, Very Good, Premium, Premium, Pre…
13.3.3 select()
select()
is used for choosing specific columns.
%>%
diamonds filter(clarity == "VVS1", price < 1000) %>%
select(cut, carat, price)
# recall indexing:
1:2] # first two columns foo_df[
## # A tibble: 6 × 2
## healthy tissue
## <lgl> <chr>
## 1 TRUE Liver
## 2 FALSE Brain
## 3 FALSE Testes
## 4 TRUE Muscle
## 5 TRUE Intestine
## 6 FALSE Heart
-3] # all except the 3rd foo_df[
## # A tibble: 6 × 2
## healthy tissue
## <lgl> <chr>
## 1 TRUE Liver
## 2 FALSE Brain
## 3 FALSE Testes
## 4 TRUE Muscle
## 5 TRUE Intestine
## 6 FALSE Heart
This is all fine and good, but we also have a number of helper functions that we can use to access columns.
Helper function | Description |
---|---|
starts_with(x) |
Names starts with x. |
ends_with(x) |
Names ends in x. |
contains(x) |
Selects variable names containing x. |
matches(x) |
Selects all variable names matching the regex x. |
num_range("x", 1:5) |
Selects x1 to x5. |
one_of("x", "y", "z") |
Selects variables in a character vector. |
everything() |
Selects all variables. |
%>%
diamonds select(starts_with("c"))
Remember, we can also use the -
notation (like we did in []
) to not select specific columns.46
%>%
diamonds select(-starts_with("c"))
OK, so now that we have our columns, we can pass them along to some functions:
%>%
diamonds select(price, carat) %>%
log10()
In this very simplistic view, we need to attach the original data frame back to the \(log_2\) transformed columns. This is doable, but there are better ways. For example, we can use the dplyr
function bind_cols()
. This is a dplyr
version of cbind()
.
%>%
diamonds select(price, carat) %>%
log10() %>%
bind_cols(diamonds)
There are even better ways, which we’ll see in the next section.
13.3.4 mutate() & summarise()
Remember, we have two basic types of functions we want to apply to our data:
- Transformation using
mutate()
- Aggregations using
summarise()
Some common normalisations (transformations) include z-scores and scaling on a min-max range [0,1]:
%>%
diamonds mutate(price_log10 = log10(price),
carat_log10 = log10(carat))
Descriptive statistics are aggregation functions:
%>%
diamonds group_by(color) %>%
summarise(ave = mean(price))
## # A tibble: 7 × 2
## color ave
## <ord> <dbl>
## 1 D 3170.
## 2 E 3077.
## 3 F 3725.
## 4 G 3999.
## 5 H 4487.
## 6 I 5092.
## 7 J 5324.
```
dplyr
is the data-frame centric cousin of an earlier package calledplyr
. The nameplyr
brings to mind the word plyer, a hand-held tool used to hold, compress and transform all variety of materials.↩︎Notice the use of the third possible assign operator here,
->
, which allows us to readdplyr
commands like grammatically correct sentences: subject %>% verb -> object.↩︎The
-
notation is like how we use!=
in logical expressions in in[]
to not select specific observations.↩︎