Chapter 13 2.1.6 Tidyverse

13.1 Learning Objectives

The dplyr ecosystem is composed of three main components:

  • The pipe operator %>%, previously discussed
  • The five verbs of dplyr:
    • filter(),
    • arrange(),
    • select(),
    • mutate(),
    • summarise(), and
  • group_by(), previously discussed

13.2 Split-Apply-Combine with the dplyr Package

Split-apply-combine refers to a series of actions that are often repeated in data analysis and statistics. A data-set is

  • Split into sub-groups defined by a categorical (aka factor) variable. A function is then
  • Applied to each individual sub-set, and the results are
  • Combined into a new data-set or added onto the original data-set.

There are many ways to perform split-apply-combine operations in R. Actually, we already saw this in action on the first day of the workshop (4).

PlantGrowth %>%
  group_by(group) %>%
  summarise(avg = mean(weight),
            stdev = sd(weight))
## # A tibble: 3 × 3
##   group   avg stdev
##   <fct> <dbl> <dbl>
## 1 trt2   5.53 0.443
## 2 trt1   4.66 0.794
## 3 ctrl   5.03 0.583

The apply family of functions are powerful are useful base package functions. Typical base package functions include apply, tapply(), sapply(), lapply(), mapply(), by(), and aggregrate(). However, they proved difficult to master for new-comers since the class for the input and output of each function was different, and not obvious. Recall, one of the biggest problems you’ll encounter is having data in the wrong class! There has been a major effort within the R community to develop easier tools for performing these tasks. The dplyr package has now come to dominate split-apply-combine tasks, largely because the commands are intuitive and syntactically uniform. The many packages which have moved R into this direction are called the tidyverse.

The dplyr package takes the analogy of the program language closer to spoken language by referring to the data frames as nouns and the actions performed as verbs.

There are three main components to understanding dplyr.44

  • The pipe operator, as discussed previously
  • The five verbs plus the helper functions
  • An adverb

13.3 The Five Verbs

There are five verbs which we can use to act upon our noun:

Table 13.1: The five dplyr verbs.
Verb Operates on Description Section
filter() Observations Filter observations given specific criteria. Section 13.3.1
arrange() Observations Rearrange observations given specific criteria. Section 13.3.2
select() Variables Select variables meeting specific criteria. Section 13.3.3
mutate() Variables Apply transformation functions on selected variables. Section 13.3.4
summarise() Variables Apply aggregation functions on selected variables. Section 13.3.4

Let’s take a look at these functions in the context of operations we already understand

13.3.1 filter()

Recall that we can use filter() or [] to filter a data frame.

diamonds %>% 
  filter(clarity == "VVS1")

# is the same as
diamonds[diamonds$clarity == "VVS1", ]

The above functions are equivalent.

The advantage is that it’s easy to combine many logical expressions with a comma:

diamonds %>% 
  filter(clarity == "VVS1", price < 1000)

13.3.2 arrange()

We can imagine various definitons of top values, such as:

  • Top values defined by a certain cut-off, i.e. values above 35 (use a logical expression with the value)
  • Top values according to a quantile, i.e. frac of the total (use a logical expression and define the cutoff dynamicall)
  • Top n number of values, i.e. highest 5 values (arrange() and take the head(), or use top_n())

We discussed the first two cases already. Here, we’ll use the arrange() function:45

diamonds %>%
  top_n(20, price) %>% 
  select(cut) -> top20cut

Alternatively, we can do this more explicitly:

diamonds %>%
    arrange(desc(price)) %>%
    select(cut) %>%
    .[1:20,] -> top20cut

Some alternatives for the last line in the above commands:

slice(1:20) -> top20cut

or

head(20)  -> top20cut

The result is a one-column data frame. Remember, dplyr is very data frame-centric.

glimpse(top20cut)
## Rows: 20
## Columns: 1
## $ cut <ord> Premium, Very Good, Ideal, Ideal, Very Good, Premium, Premium, Pre…

13.3.3 select()

select() is used for choosing specific columns.

diamonds %>%
    filter(clarity == "VVS1", price < 1000) %>% 
    select(cut, carat, price)
# recall indexing:
foo_df[1:2] # first two columns
## # A tibble: 6 × 2
##   healthy tissue   
##   <lgl>   <chr>    
## 1 TRUE    Liver    
## 2 FALSE   Brain    
## 3 FALSE   Testes   
## 4 TRUE    Muscle   
## 5 TRUE    Intestine
## 6 FALSE   Heart
foo_df[-3] # all except the 3rd
## # A tibble: 6 × 2
##   healthy tissue   
##   <lgl>   <chr>    
## 1 TRUE    Liver    
## 2 FALSE   Brain    
## 3 FALSE   Testes   
## 4 TRUE    Muscle   
## 5 TRUE    Intestine
## 6 FALSE   Heart

This is all fine and good, but we also have a number of helper functions that we can use to access columns.

Table 13.2: The dplyr helper functions.
Helper function Description
starts_with(x) Names starts with x.
ends_with(x) Names ends in x.
contains(x) Selects variable names containing x.
matches(x) Selects all variable names matching the regex x.
num_range("x", 1:5) Selects x1 to x5.
one_of("x", "y", "z") Selects variables in a character vector.
everything() Selects all variables.
diamonds %>%
    select(starts_with("c"))

Remember, we can also use the - notation (like we did in []) to not select specific columns.46

diamonds %>%
    select(-starts_with("c"))

OK, so now that we have our columns, we can pass them along to some functions:

diamonds %>%
    select(price, carat) %>%
    log10()

In this very simplistic view, we need to attach the original data frame back to the \(log_2\) transformed columns. This is doable, but there are better ways. For example, we can use the dplyr function bind_cols(). This is a dplyr version of cbind().

diamonds %>%
    select(price, carat) %>%
    log10() %>%
    bind_cols(diamonds)

There are even better ways, which we’ll see in the next section.

13.3.4 mutate() & summarise()

Remember, we have two basic types of functions we want to apply to our data:

  • Transformation using mutate()
  • Aggregations using summarise()

Some common normalisations (transformations) include z-scores and scaling on a min-max range [0,1]:

diamonds %>%
  mutate(price_log10 = log10(price),
         carat_log10 = log10(carat))

Descriptive statistics are aggregation functions:

diamonds %>%
  group_by(color) %>% 
  summarise(ave = mean(price))
## # A tibble: 7 × 2
##   color   ave
##   <ord> <dbl>
## 1 D     3170.
## 2 E     3077.
## 3 F     3725.
## 4 G     3999.
## 5 H     4487.
## 6 I     5092.
## 7 J     5324.

```


  1. dplyr is the data-frame centric cousin of an earlier package called plyr. The name plyr brings to mind the word plyer, a hand-held tool used to hold, compress and transform all variety of materials.↩︎

  2. Notice the use of the third possible assign operator here, ->, which allows us to read dplyr commands like grammatically correct sentences: subject %>% verb -> object.↩︎

  3. The - notation is like how we use != in logical expressions in in [] to not select specific observations.↩︎