9 Control Structures & Reitertions

9.1 Learning Objectives

By the end of this chapter you should:

Understand the two main types of control structures: conditional statements and reiterations
Be able to manage multiple files in an efficient way

9.2 Control Structures

Control structures are a core feature of every programming and scripting language. They allow you to repeat or execute a set of commands given the outcome of a specific condition (e.g. a logical expression). Control structures are used when you assemble a series of commands into a script. We will take a look at two of the most common structures and give a practical example with a data-set provided for the workshop.

for loops will repeat commands for a fixed number of times.⁶

if statements will execute commands based on the outcome of a logical test.

9.3 for loops

for loops allow you to repeat a set of commands a given number of times. In the generic example below, i takes on a number and automatically increases it by one at the end of the loop:


for (i in 1:3) {
  print(i)
}
#> [1] 1
#> [1] 2
#> [1] 3

NB. We need to use print() inside a for loop to explicitly print a value to the screeen.

Take the following built-in character vector:

LETTERS
#>  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N"
#> [15] "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

In the generic example below, i is used as the index position for LETTERS:


for (i in 1:3) {
  print(LETTERS[i])
}
#> [1] "A"
#> [1] "B"
#> [1] "C"

Contrast the above examples, where i was used just as the value itself and then used as the index position.

for (i in 1:3) {
  print(paste(LETTERS[i],"is at position",i))
}
#> [1] "A is at position 1"
#> [1] "B is at position 2"
#> [1] "C is at position 3"

This syntax is greatly simplified using the glue package. Here the function glue() accepts a character vector a input. With the character in-line R code is specified using curly brackets {}, e.g. {LETTERS[i]} and {i}.

library(glue)

for (i in 1:3) {
  print(glue("{LETTERS[i]} is at position {i}"))
}
#> A is at position 1
#> B is at position 2
#> C is at position 3

This may seem like a trivial example, but looping is quite powerful when you need to repeat a given task several times. Consider the five files beginning with p53 in the ./data/peptides/ directory of the data provided with this course. Each file contains two columns of data. If we wanted to combine all five files into one data frame, we may naively do something like the following:

# using readr functions:
peptides1 <- read_tsv("data/peptides/p53_noacid_100_E13.txt")
peptides2 <- read_tsv("data/peptides/p53_noacid_100_E15.txt")
# and so on, creating a data frame for each file....

After creating 5 separate data frames, we could then merge them all into a single data frame. This is not only extremely tedious, but also error-prone. Imagine what would happen if we had 1000 files? Instead we can use a for loop to read in each file and build a cumulative data frame.

Exercise 9.1 Create a for loop that reads in all files in the workshop data/peptides folder, one after another, continuously building a single data frame from the individual data frames.

Use the following functions:

read_tsv() (or base read.delim())
list.files()
bind_rows() (or base rbind()), and
A for loop.

Save your data frame as an object called peptides. It should have the following properties:

names(peptides)
#> [1] "mz"        "intensity"
dim(peptides)
#> [1] 55  2

Exercise 9.2 This is a good start, but there is no way to identify which file each value in the cumulative peptides data frame comes from. Therefore, we need to add a third column (named variables to each data frame before merging it with the compiled peptides data frame. peptides$variables should contain the name of the source file for every observation. Use rep() and nrow()

peptides should now have the following properties:

names(peptides)
#> [1] "mz"        "intensity" "variables"
dim(peptides)
#> [1] 55  3

Exercise 9.3 The file names are composed of a combination of four variables, separated by an underscore. Use the separate() function to split the file name variable in the data frame into four separate variables.

peptides should now have the following properties:

names(peptides)
#> [1] "mz"        "intensity" "var1"      "var2"     
#> [5] "var3"      "var4"
dim(peptides)
#> [1] 55  6

Exercise 9.4 Using the for loops, import the contents of the data files stored in the folder coresponding to the challenge project you chose at the beignning of the course. The name of the files is not necessary to identify where the data comes from, but you can challenge youself by incorporating it into the data set and confirming that it matches the existing columns.

9.3.1 while loops

while loops are a variant of for loops, except that in this case, they repeat commands while a certain condition is true. However, be cautious! If the condition never becomes false, your script will never exit the loop. While loops have the generic structure of:

i <- 1
while (i <= 3) {
  print(i)
  i <- i + 1
}
#> [1] 1
#> [1] 2
#> [1] 3

9.4 Reiterations with the tidyverse

In the tidyverse family of packages, there are two packages which come in handy for reiteration: vroom and purrr.

The vroom package is designed specifically to make it easier to import tabular data into R using the main function vroom().

# load package
library(vroom)

# Get file names, with full path
files <- list.files("data/peptides/", full.names = TRUE)

all_peptides <- vroom(files)
names(all_peptides)
#> [1] "mz"        "intensity"
dim(all_peptides)
#> [1] 55  2

We can conveniently add an id column, here called ID, which will be populated by the names of the actual files.

all_peptides <- vroom(files, id = "ID")
names(all_peptides)
#> [1] "ID"        "mz"        "intensity"
dim(all_peptides)
#> [1] 55  3

We can then use some convenient funcitons to clean up the names if they contain information, like basename(), which isolates the name of the file, and separate() as we saw earlier.

files %>% 
  vroom(id = "ID") %>% 
  mutate(ID = basename(ID)) %>% 
  separate(ID, c("var1", "var2", "var3", "var4", NA)) -> all_peptides

names(all_peptides)
#> [1] "var1"      "var2"      "var3"      "var4"     
#> [5] "mz"        "intensity"
dim(all_peptides)
#> [1] 55  6

Exercise 9.5 Using vroom(), import the contents of the data files stored in the folder coresponding to the challenge project you chose at the beignning of the course. The name of the files is not necessary to identify where the data comes from, but you can challenge youself by incorporating it into the data set and confirming that it matches the existing columns.

The purrr package is part of the 8 core packages of the tidyverse and has two main groups of functions:

map() - For reiterating using functions that return an output
walk() - For reiterating using functions for their sideeffects.

Let’s begin with map and try to read in our five files, whose names are stored in files. To send each piece of input to a function where the first argument is our target, we can just state the function name, as such:

files %>% 
  map(read_tsv)

But often the item needs to go somewhere else, in such a case, we can speficy exactly where using the ~ operator, which can be read as described by like in statistics, and then using the . placeholder where the input value should go, as such:

files %>% 
  map(~ read_tsv(.)) -> purrr_peptides

# glimpse(purrr_peptides)
typeof(purrr_peptides)
#> [1] "list"
length(purrr_peptides)
#> [1] 5

Each element of the input, in this case a vector is used one-at-a-time as the argument for the function read_tsv(). This is really convenient, but the output is a list, there are many variations on both map() and walk(). We need one of the map_*() variants which specify what type of object the out put should be. In our case it’s a dataframe, thus map_df().

files %>% 
  map_df(read_tsv) -> purrr_peptides

names(purrr_peptides)
#> [1] "mz"        "intensity"
dim(purrr_peptides)
#> [1] 55  2

We can also include information about the iteration by using the .id argument:

files %>% 
  map_df(read_tsv, .id = "ID") -> purrr_peptides

names(purrr_peptides)
#> [1] "ID"        "mz"        "intensity"
dim(purrr_peptides)
#> [1] 55  3

We don’t get information about the file, just the iteration cycle as a character.

glimpse(purrr_peptides)
#> Rows: 55
#> Columns: 3
#> $ ID        <chr> "1", "1", "1", "1", "1", "1", "1", "1", …
#> $ mz        <dbl> 907.2918, 982.6067, 1051.7055, 1439.8366…
#> $ intensity <dbl> 99.30016, 69.13245, 184.84150, 111.52148…

Thus, we have to do the work to fill in the values. The easiest way to do this is to convert the ID variable to a factor and then set the levels to those of the files.

files %>% 
  map_df(read_tsv, .id = "ID") %>% 
  mutate(ID = factor(ID, labels = basename(files))) -> all_peptides

names(purrr_peptides)
#> [1] "ID"        "mz"        "intensity"
dim(purrr_peptides)
#> [1] 55  3

As above, we can use separate to get individual columns:

files %>% 
  map_df(read_tsv, .id = "ID") %>% 
  mutate(ID = factor(ID, labels = basename(files))) %>% 
  separate(ID, c("var1", "var2", "var3", "var4", NA)) -> purrr_peptides

names(purrr_peptides)
#> [1] "var1"      "var2"      "var3"      "var4"     
#> [5] "mz"        "intensity"
dim(purrr_peptides)
#> [1] 55  6

So that’s a lot more work than what was needed with vroom, but the you can imagine that it’s much more flexible! If we want to do somethng to specific parts of a dataframe, then we’t need to first split each part into a list. There are a few ways to do this, in base R we can use:

# Base R:
mt_split <- split(mtcars, mtcars$cyl)

# Alternatively:
# mtcars %>%
#   split(.$cyl) -> mt_split

# output
# glimpse(mt_split)
typeof(mt_split)
#> [1] "list"
length(mt_split)
#> [1] 3

In the tidyverse syntax, we can so something similar

# Using tidyverse functions
mtcars %>%
  group_split(cyl) -> mt_split

# output
glimpse(mt_split)
#> list<tibble[,17]> [1:3] 
#> $ : tibble [11 × 17] (S3: tbl_df/tbl/data.frame)
#> $ : tibble [7 × 17] (S3: tbl_df/tbl/data.frame)
#> $ : tibble [14 × 17] (S3: tbl_df/tbl/data.frame)
#> @ ptype: tibble [0 × 17] (S3: tbl_df/tbl/data.frame)

Now when we use map() each subset will be treated independently

# Not the ~ when using lm() to position the input
mtcars %>%
  group_split(cyl)%>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")
#> [1] 0.5086326 0.4645102 0.4229655

Exercise 9.6 Using map() (and variations), import the contents of the data files stored in the folder coresponding to the challenge project you chose at the beignning of the course. The name of the files is not necessary to identify where the data comes from, but you can challenge youself by incorporating it into the data set and confirming that it matches the existing columns.

9.5 if statements

if statements, including else statements, allow your script to proceed based on the result of a logical expression. For example:

x <- 1
if (x <= 1) {
  print("Ready to proceed.")
} else {
  print("Error.")
}
#> [1] "Ready to proceed."

Exercise 9.7 Returning to the for loop from the previous section, suppose that we needed to do a quality control for data-sets containing more than 10% incomplete observations. Insert an if statement into the for loop which checks for this. If the data frame passes our quality check, then add it to the final peptides data frame, if it doesn’t then print an error message to the screen.

Use an if statement, nrow(), sum(), print() and `paste()

You should have the following message printed to the screen:

#> [1] "Error in file data/peptides/p53_noacid_100_E15.txt"

If you completed all the exercises on the preceding page, your data frame should have the following dimensions:

dim(peptides)
#> [1] 44  6

There is one further extension of if statements, the ifelse() function:

foo3 <- c("Liver", "Brain", "Testes",
          "Muscle", "Intestine", "Heart")

ifelse(foo3 %in% c("Heart", "Liver"), "yes", "no")
#> [1] "yes" "no"  "no"  "no"  "no"  "yes"

This can of course, be used to transform a data frame with mutate():

tibble(tissue = c("Liver", "Brain", "Testes",
                  "Muscle", "Intestine", "Heart")) %>% 
  mutate(germ_layer = ifelse(tissue %in% c("Heart", "Muscle", "Testes"), "processed", "unavailable"))
#> # A tibble: 6 × 2
#>   tissue    germ_layer 
#>   <chr>     <chr>      
#> 1 Liver     unavailable
#> 2 Brain     unavailable
#> 3 Testes    processed  
#> 4 Muscle    processed  
#> 5 Intestine unavailable
#> 6 Heart     processed

9.6 Simulation Challenge

As a challenge, test your abilities with control structures by trying to solve the following puzzle.

Given that a prime number is a whole number, greater than 1, which can only be evenly divided by itself or 1, find the group of four prime numbers such that:

The sum of any combination of three numbers is also a prime number, and,
The sum of all four numbers is as small as possible.

One solution is to begin with the smallest four prime numbers and test if requirement 1 is fulfilled. If this is not the case, then we can begin taking larger prime numbers. This would work, but is a bit cumbersome because you would then have to test the other 4-number combinations which may have smaller group sums. An easier solution is to use use for loops and if statements to simulate a large data-set containing all permutations of four prime numbers and test each for its ability to fulfill the two requirements of the puzzle.

To help you solve this, I have provided an outline for you to follow. We will take the following strategy:

Make an object containing prime numbers. We can limit ourselves to the first 17 prime numbers (up to 53). The schoolmath::primes() function provides a starting point⁷:

> library(schoolmath)
> pNumbs <- primes(1, 53)[-1]
> pNumbs
 [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47 53
> # Score to beat
> result <- 4*53

Here, I have also created a result vector, which contains the highest possible sum. This is the “score to beat,” and will be replaced by the sum of the four sampled digits, but only if it is less than the stored value.

Randomly sample four numbers from our pool of 17 numbers⁸.
Determine⁹ if the sum of each combination of three digits is also a prime number.
If¹⁰ all four 3-number combinations are prime and if they have the lowest 4-digit sum, then they are saved as a new object called solution. The sum of these four numbers is the new “score to beat” (i.e. overwrite the result object).
Continue up to 10000 iterations, overwriting solution if a better four-number combination is found.

Your solution should be: 5, 7, 17, 19.

8 Programming with the Tidyverse

10 R Markdown