9 Control Structures & Reitertions
9.1 Learning Objectives
By the end of this chapter you should:
- Understand the two main types of control structures: conditional statements and reiterations
- Be able to manage multiple files in an efficient way
9.2 Control Structures
Control structures are a core feature of every programming and scripting language. They allow you to repeat or execute a set of commands given the outcome of a specific condition (e.g. a logical expression). Control structures are used when you assemble a series of commands into a script. We will take a look at two of the most common structures and give a practical example with a data-set provided for the workshop.
for loops will repeat commands for a fixed number of times.6
if statements will execute commands based on the outcome of a logical test.
9.3 for loops
for loops
allow you to repeat a set of commands a given number of times. In the generic example below, i takes on a number and automatically increases it by one at the end of the loop:
for (i in 1:3) {
print(i)
}
#> [1] 1
#> [1] 2
#> [1] 3
NB. We need to use print()
inside a for loop
to explicitly print a value to the screeen.
Take the following built-in character vector:
LETTERS
#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N"
#> [15] "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
In the generic example below, i is used as the index position for LETTERS
:
for (i in 1:3) {
print(LETTERS[i])
}
#> [1] "A"
#> [1] "B"
#> [1] "C"
Contrast the above examples, where i was used just as the value itself and then used as the index position.
for (i in 1:3) {
print(paste(LETTERS[i],"is at position",i))
}
#> [1] "A is at position 1"
#> [1] "B is at position 2"
#> [1] "C is at position 3"
This syntax is greatly simplified using the glue package. Here the function glue()
accepts a character vector a input. With the character in-line R code is specified using curly brackets {}
, e.g. {LETTERS[i]}
and {i}
.
library(glue)
for (i in 1:3) {
print(glue("{LETTERS[i]} is at position {i}"))
}
#> A is at position 1
#> B is at position 2
#> C is at position 3
This may seem like a trivial example, but looping is quite powerful when you need to repeat a given task several times. Consider the five files beginning with p53 in the ./data/peptides/
directory of the data provided with this course. Each file contains two columns of data. If we wanted to combine all five files into one data frame, we may naively do something like the following:
# using readr functions:
peptides1 <- read_tsv("data/peptides/p53_noacid_100_E13.txt")
peptides2 <- read_tsv("data/peptides/p53_noacid_100_E15.txt")
# and so on, creating a data frame for each file....
After creating 5 separate data frames, we could then merge them all into a single data frame. This is not only extremely tedious, but also error-prone. Imagine what would happen if we had 1000 files? Instead we can use a for loop
to read in each file and build a cumulative data frame.
for loop
that reads in all files in the workshop data/peptides
folder, one after another, continuously building a single data frame from the individual data frames.
Use the following functions:
-
read_tsv()
(or baseread.delim()
) list.files()
-
bind_rows()
(or baserbind()
), and - A
for loop
.
Save your data frame as an object called peptides
. It should have the following properties:
peptides
data frame comes from. Therefore, we need to add a third column (named variables
to each data frame before merging it with the compiled peptides
data frame. peptides$variables
should contain the name of the source file for every observation. Use rep()
and nrow()
peptides
should now have the following properties:
separate()
function to split the file name variable in the data frame into four separate variables.
peptides
should now have the following properties:
names(peptides)
#> [1] "mz" "intensity" "var1" "var2"
#> [5] "var3" "var4"
dim(peptides)
#> [1] 55 6
for loops
, import the contents of the data files stored in the folder coresponding to the challenge project you chose at the beignning of the course. The name of the files is not necessary to identify where the data comes from, but you can challenge youself by incorporating it into the data set and confirming that it matches the existing columns.
9.3.1 while loops
while loops
are a variant of for loops
, except that in this case, they repeat commands while a certain condition is true. However, be cautious! If the condition never becomes false, your script will never exit the loop. While loops
have the generic structure of:
i <- 1
while (i <= 3) {
print(i)
i <- i + 1
}
#> [1] 1
#> [1] 2
#> [1] 3
9.4 Reiterations with the tidyverse
In the tidyverse family of packages, there are two packages which come in handy for reiteration: vroom
and purrr
.
The vroom
package is designed specifically to make it easier to import tabular data into R using the main function vroom()
.
# load package
library(vroom)
# Get file names, with full path
files <- list.files("data/peptides/", full.names = TRUE)
all_peptides <- vroom(files)
names(all_peptides)
#> [1] "mz" "intensity"
dim(all_peptides)
#> [1] 55 2
We can conveniently add an id column, here called ID, which will be populated by the names of the actual files.
all_peptides <- vroom(files, id = "ID")
names(all_peptides)
#> [1] "ID" "mz" "intensity"
dim(all_peptides)
#> [1] 55 3
We can then use some convenient funcitons to clean up the names if they contain information, like basename()
, which isolates the name of the file, and separate()
as we saw earlier.
files %>%
vroom(id = "ID") %>%
mutate(ID = basename(ID)) %>%
separate(ID, c("var1", "var2", "var3", "var4", NA)) -> all_peptides
names(all_peptides)
#> [1] "var1" "var2" "var3" "var4"
#> [5] "mz" "intensity"
dim(all_peptides)
#> [1] 55 6
vroom()
, import the contents of the data files stored in the folder coresponding to the challenge project you chose at the beignning of the course. The name of the files is not necessary to identify where the data comes from, but you can challenge youself by incorporating it into the data set and confirming that it matches the existing columns.
The purrr package is part of the 8 core packages of the tidyverse and has two main groups of functions:
- map() - For reiterating using functions that return an output
- walk() - For reiterating using functions for their sideeffects.
Let’s begin with map and try to read in our five files, whose names are stored in files
. To send each piece of input to a function where the first argument is our target, we can just state the function name, as such:
files %>%
map(read_tsv)
But often the item needs to go somewhere else, in such a case, we can speficy exactly where using the ~
operator, which can be read as described by like in statistics, and then using the .
placeholder where the input value should go, as such:
files %>%
map(~ read_tsv(.)) -> purrr_peptides
# glimpse(purrr_peptides)
typeof(purrr_peptides)
#> [1] "list"
length(purrr_peptides)
#> [1] 5
Each element of the input, in this case a vector is used one-at-a-time as the argument for the function read_tsv()
. This is really convenient, but the output is a list, there are many variations on both map()
and walk()
. We need one of the map_*()
variants which specify what type of object the out put should be. In our case it’s a dataframe, thus map_df()
.
files %>%
map_df(read_tsv) -> purrr_peptides
names(purrr_peptides)
#> [1] "mz" "intensity"
dim(purrr_peptides)
#> [1] 55 2
We can also include information about the iteration by using the .id
argument:
files %>%
map_df(read_tsv, .id = "ID") -> purrr_peptides
names(purrr_peptides)
#> [1] "ID" "mz" "intensity"
dim(purrr_peptides)
#> [1] 55 3
We don’t get information about the file, just the iteration cycle as a character.
glimpse(purrr_peptides)
#> Rows: 55
#> Columns: 3
#> $ ID <chr> "1", "1", "1", "1", "1", "1", "1", "1", …
#> $ mz <dbl> 907.2918, 982.6067, 1051.7055, 1439.8366…
#> $ intensity <dbl> 99.30016, 69.13245, 184.84150, 111.52148…
Thus, we have to do the work to fill in the values. The easiest way to do this is to convert the ID variable to a factor and then set the levels to those of the files.
files %>%
map_df(read_tsv, .id = "ID") %>%
mutate(ID = factor(ID, labels = basename(files))) -> all_peptides
names(purrr_peptides)
#> [1] "ID" "mz" "intensity"
dim(purrr_peptides)
#> [1] 55 3
As above, we can use separate to get individual columns:
files %>%
map_df(read_tsv, .id = "ID") %>%
mutate(ID = factor(ID, labels = basename(files))) %>%
separate(ID, c("var1", "var2", "var3", "var4", NA)) -> purrr_peptides
names(purrr_peptides)
#> [1] "var1" "var2" "var3" "var4"
#> [5] "mz" "intensity"
dim(purrr_peptides)
#> [1] 55 6
So that’s a lot more work than what was needed with vroom
, but the you can imagine that it’s much more flexible! If we want to do somethng to specific parts of a dataframe, then we’t need to first split each part into a list. There are a few ways to do this, in base R we can use:
# Base R:
mt_split <- split(mtcars, mtcars$cyl)
# Alternatively:
# mtcars %>%
# split(.$cyl) -> mt_split
# output
# glimpse(mt_split)
typeof(mt_split)
#> [1] "list"
length(mt_split)
#> [1] 3
In the tidyverse syntax, we can so something similar
# Using tidyverse functions
mtcars %>%
group_split(cyl) -> mt_split
# output
glimpse(mt_split)
#> list<tibble[,17]> [1:3]
#> $ : tibble [11 × 17] (S3: tbl_df/tbl/data.frame)
#> $ : tibble [7 × 17] (S3: tbl_df/tbl/data.frame)
#> $ : tibble [14 × 17] (S3: tbl_df/tbl/data.frame)
#> @ ptype: tibble [0 × 17] (S3: tbl_df/tbl/data.frame)
Now when we use map()
each subset will be treated independently
# Not the ~ when using lm() to position the input
mtcars %>%
group_split(cyl)%>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
#> [1] 0.5086326 0.4645102 0.4229655
map()
(and variations), import the contents of the data files stored in the folder coresponding to the challenge project you chose at the beignning of the course. The name of the files is not necessary to identify where the data comes from, but you can challenge youself by incorporating it into the data set and confirming that it matches the existing columns.
9.5 if statements
if statements
, including else
statements, allow your script to proceed based on the result of a logical expression. For example:
x <- 1
if (x <= 1) {
print("Ready to proceed.")
} else {
print("Error.")
}
#> [1] "Ready to proceed."
Exercise 9.7 Returning to the for loop
from the previous section, suppose that we needed to do a quality control for data-sets containing more than 10% incomplete observations. Insert an if statement
into the for loop
which checks for this. If the data frame passes our quality check, then add it to the final peptides
data frame, if it doesn’t then print an error message to the screen.
#> [1] "Error in file data/peptides/p53_noacid_100_E15.txt"
If you completed all the exercises on the preceding page, your data frame should have the following dimensions:
dim(peptides)
#> [1] 44 6
There is one further extension of if statements, the ifelse()
function:
foo3 <- c("Liver", "Brain", "Testes",
"Muscle", "Intestine", "Heart")
ifelse(foo3 %in% c("Heart", "Liver"), "yes", "no")
#> [1] "yes" "no" "no" "no" "no" "yes"
This can of course, be used to transform a data frame with mutate()
:
tibble(tissue = c("Liver", "Brain", "Testes",
"Muscle", "Intestine", "Heart")) %>%
mutate(germ_layer = ifelse(tissue %in% c("Heart", "Muscle", "Testes"), "processed", "unavailable"))
#> # A tibble: 6 × 2
#> tissue germ_layer
#> <chr> <chr>
#> 1 Liver unavailable
#> 2 Brain unavailable
#> 3 Testes processed
#> 4 Muscle processed
#> 5 Intestine unavailable
#> 6 Heart processed
9.6 Simulation Challenge
As a challenge, test your abilities with control structures by trying to solve the following puzzle.
Given that a prime number is a whole number, greater than 1, which can only be evenly divided by itself or 1, find the group of four prime numbers such that:
- The sum of any combination of three numbers is also a prime number, and,
- The sum of all four numbers is as small as possible.
One solution is to begin with the smallest four prime numbers and test if requirement 1 is fulfilled. If this is not the case, then we can begin taking larger prime numbers. This would work, but is a bit cumbersome because you would then have to test the other 4-number combinations which may have smaller group sums. An easier solution is to use use for loops
and if statements
to simulate a large data-set containing all permutations of four prime numbers and test each for its ability to fulfill the two requirements of the puzzle.
To help you solve this, I have provided an outline for you to follow. We will take the following strategy:
- Make an object containing prime numbers. We can limit ourselves to the first 17 prime numbers (up to 53). The
schoolmath::primes()
function provides a starting point7:
> library(schoolmath)
> pNumbs <- primes(1, 53)[-1]
> pNumbs
1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53
[> # Score to beat
> result <- 4*53
Here, I have also created a result
vector, which contains the highest possible sum. This is the “score to beat,” and will be replaced by the sum of the four sampled digits, but only if it is less than the stored value.
Randomly sample four numbers from our pool of 17 numbers8.
Determine9 if the sum of each combination of three digits is also a prime number.
If10 all four 3-number combinations are prime and if they have the lowest 4-digit sum, then they are saved as a new object called
solution
. The sum of these four numbers is the new “score to beat” (i.e. overwrite theresult
object).Continue up to 10000 iterations, overwriting
solution
if a better four-number combination is found.
Your solution should be: 5, 7, 17, 19.