Chapter 21 Element 10: Control Structures

21.1 Learning Objectives

By the end of this chapter you should:

Know how to create basic custom functions
Understand the two main types of control structures: conditional statements and reiterations
Be able to manage multiple files in an efficient way

21.2 Defining New Functions

We can also write our own functions. You can get lots done without creating your own functions. However, they are useful for integrating repetitive actions into a longer script, or executing complicated actions from within other functions. Returning to the linear equation from the first exercise, we can make a function called equation(), using the following notation:

equation <- function(x) {
  1.12*x-0.4
  }

equation(xx)

# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56

Exercise 21.1 (Writing a function) Using the example above as a model, write a function called lin that accepts three arguments, x, m and b, to calculate predicted values based on a $y=mx+b$ equation. Just stick with the values of m and b we’ve been using so far.

21.3 Scoping

Scoping refers to the set of rules that a programming language uses in finding the value of an object. It’s exactly what happens when we call:

#  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# [11] FALSE FALSE FALSE FALSE FALSE

R looks in the global environment and sees if there are objects called m and then it prints it to the screen. Scoping also happens inside functions. But if a name isn’t defined inside a function, R will progressively look one level up, until it reaches the global environment. This happens when the function is run, not when it’s created. That means that the output of a function can be different depending on objects outside its environment. That can be exactly what you want, but it can also be very dangerous!

plus <- function(x) {x + y}
plus(1) # error, y not found

# Error in plus(1): object 'y' not found

y <- 1
plus(10) # y taken from global environment!

# [1] 11

plus <- function(x, y = 10) {x + y}
plus(10) # y in global environment is masked

# [1] 20

Exercise 21.2 (Creating functions) Specifying default argument values in the function definition allows it to be used without having to define all arguments each time the function is used. Re-write the above function from exercise XYZ, defining m and b as previously used. This way, the user has the ability to call these arguments, or not. You should be able to reproduce the following commands:

xx

# [1] 1 2 3 4 5 6 7 8

lin(xx)

# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56

lin(xx, 5, 60)

# [1]  65  70  75  80  85  90  95 100

lin(xx, b = 10)

# [1] 11 12 13 14 16 17 18 19

That’s all good, but this:

lin(xx, m = c(0,1.12))

# [1] -0.4  1.8 -0.4  4.1 -0.4  6.3 -0.4  8.6

still doesn’t work as expected! There are many solutions for this. We’ll take a look at a typical way using one of the apply functions. lapply() takes a list or a vector (in this case either xx or m2) and a function (lin(), plus any additional arguments) as input.⁵⁶

lapply(m2, lin, x = xx)

# [[1]]
# [1] -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4
# 
# [[2]]
# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56

# or, the other way around
lapply(xx, lin, m = m2)

# [[1]]
# [1] -0.40  0.72
# 
# [[2]]
# [1] -0.4  1.8
# 
# [[3]]
# [1] -0.4  3.0
# 
# [[4]]
# [1] -0.4  4.1
# 
# [[5]]
# [1] -0.4  5.2
# 
# [[6]]
# [1] -0.4  6.3
# 
# [[7]]
# [1] -0.4  7.4
# 
# [[8]]
# [1] -0.4  8.6

sapply() tries to simplify the output:

sapply(m2, lin, x = xx)

#      [,1] [,2]
# [1,] -0.4 0.72
# [2,] -0.4 1.84
# [3,] -0.4 2.96
# [4,] -0.4 4.08
# [5,] -0.4 5.20
# [6,] -0.4 6.32
# [7,] -0.4 7.44
# [8,] -0.4 8.56

Now, if we didn’t want to define a function _outside of lapply() or sapply(), we could have created an anonymous function, i.e. a function without an explicit name.⁵⁷ Again, note that the m in our anonymous function masks the m in our environment, which remains unchanged.

sapply(m2, function(m) {m*xx+b})

#      [,1] [,2]
# [1,] -0.4 0.72
# [2,] -0.4 1.84
# [3,] -0.4 2.96
# [4,] -0.4 4.08
# [5,] -0.4 5.20
# [6,] -0.4 6.32
# [7,] -0.4 7.44
# [8,] -0.4 8.56

This logic is the reason this also works:

sapply(m2, function(x) lin(xx, x))

#      [,1] [,2]
# [1,] -0.4 0.72
# [2,] -0.4 1.84
# [3,] -0.4 2.96
# [4,] -0.4 4.08
# [5,] -0.4 5.20
# [6,] -0.4 6.32
# [7,] -0.4 7.44
# [8,] -0.4 8.56

We could have also been a bit more complicated and use a for loop. But we’ll return to this on page @ref(sec:for_loops).

for (i in m2) {
  print(lin(xx, i))
}

# [1] -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4
# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56

What happens when we have more than one b? Which is the third argument? i.e. what if b is -0.4 and 10?

m * xx - 0.4

#  [1]  0.6  1.6  2.6  3.6  4.6  5.6  6.6 -0.4 -0.4 -0.4 -0.4 -0.4
# [13] -0.4 -0.4 -0.4

m * xx + 10

#  [1] 11 12 13 14 15 16 17 10 10 10 10 10 10 10 10

b2 <- c(-0.4, 10)

sapply(xx, lin, b = b2)

#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,]  0.72  1.8    3  4.1  5.2  6.3  7.4  8.6
# [2,] 11.12 12.2   13 14.5 15.6 16.7 17.8 19.0

What about when we have many ms and many bs? We can nest our solutions. Either as for loops or lapply() functions.

# Many Ms and Bs:
for (i in m2) {
  for (j in b2) {
    print(lin(xx, i, j))
  }
}

lapply(m2, function(yy) sapply(xx, lin, m = yy, b = b2))

21.4 Control Structures

Control structures are a core feature of every programming and scripting language. They allow you to repeat or execute a set of commands given the outcome of a specific condition (e.g. a logical expression). Control structures are used when you assemble a series of commands into a script. We will take a look at two of the most common structures and give a practical example with a data-set provided for the workshop.

if statements will execute commands based on the outcome of a logical test.

for loops will repeat commands for a fixed number of times.⁵⁸

21.5 if statements

if statements, including else statements, allow your script to proceed based on the result of a logical expression. For example:

x <- 1
if (x <= 1) {
  print("Ready to proceed.")
} else {
  print("Error.")
}

# [1] "Ready to proceed."

Exercise 21.3 Returning to the for loop from the previous section, suppose that we needed to do a quality control for data-sets containing more than 10% incomplete observations. Insert an if statement into the for loop which checks for this. If the data frame passes our quality check, then add it to the final peptides data frame, if it doesn’t then print an error message to the screen.

Use an if statement, nrow(), sum(), print() and `paste()

You should have the following message printed to the screen:

# [1] "Error in file data/peptides/p53_noacid_100_E15.txt"

If you completed all the exercises on the preceding page, your data frame should have the following dimensions:

dim(peptides)

# [1] 44  6

There is one further extension of if statements, the ifelse() function:

foo3

ifelse(foo3 %in% c("Heart", "Liver"), "yes", "no")

21.6 for loops

for loops allow you to repeat a set of commands a given number of times. In the generic example below, i takes on a number and automatically increases it by one at the end of the loop:

for (i in 1:3) {
  print(paste(LETTERS[i],"is at position",i))
}

# [1] "A is at position 1"
# [1] "B is at position 2"
# [1] "C is at position 3"

Contrast the above example to the for loop below, where i is the actual value in the vector, not a number:

for (i in LETTERS[1:3]) {
  print(paste("The letter is",i))
}

# [1] "The letter is A"
# [1] "The letter is B"
# [1] "The letter is C"

This may seem like a trivial example, but looping is quite powerful when you need to repeat a given task several times. Consider the five files beginning with p53 in you workshop folder. See the repository here. Each file contains two columns of data. If we wanted to combine all five files into one data frame, we may naively do something like the following:

peptides1 <- read.delim("p53_noacid_100_E13.txt")
peptides2 <- read.delim("p53_noacid_100_E15.txt")
# and so on, creating a data frame for each file....

After creating 5 separate data frames, we could then merge them all into a single data frame. This is not only extremely tedious, but also error-prone. Imagine what would happen if we had 1000 files? Instead we can use a for loop to read in each file and build a cumulative data frame.

Exercise 21.4 Use read.delim(), list.files(), bind_rows() and afor loop. Create afor loop` that reads in all files beginning with p53 in the workshop folder, one after another, continuously building a single data frame from the individual data frames.

Save your data frame as an object called peptides. It should have the following properties:

names(peptides)

# [1] "mz"        "intensity"

dim(peptides)

# [1] 55  2

Exercise 21.5 This is a good start, but there is no way to identify which file each value in the cumulative peptides data frame comes from. Therefore, we need to add a third column (named variables to each data frame before merging it with the compiled peptides data frame. peptides$variables should contain the name of the source file for every observation. Use rep() and nrow()

peptides should now have the following properties:

Exercise 21.6 The file names are composed of a combination of four variables, separated by an underscore. Split the file name variable in the data frame into four separate variables and merge them with the values from the original data frame, redefining peptides so that is has the following names. Use separate() and bind_cols()

# [1] "mz"        "intensity" "var1"      "var2"      "var3"     
# [6] "var4"

21.6.1 while loops

while loops are a variant of for loops, except that in this case, they repeat commands while a certain condition is true. However, be cautious! If the condition never becomes false, your script will never exit the loop. While loops have the generic structure of:

i <- 1
while (i <= 3) {
  print(i)
  i <- i + 1
}

# [1] 1
# [1] 2
# [1] 3

21.7 Simulation Challenge

As a challenge, test your abilities with control structures by trying to solve the following puzzle.

Given that a prime number is a whole number, greater than 1, which can only be evenly divided by itself or 1, find the group of four prime numbers such that:

The sum of any combination of three numbers is also a prime number, and,
The sum of all four numbers is as small as possible.

One solution is to begin with the smallest four prime numbers and test if requirement 1 is fulfilled. If this is not the case, then we can begin taking larger prime numbers. This would work, but is a bit cumbersome because you would then have to test the other 4-number combinations which may have smaller group sums. An easier solution is to use use for loops and if statements to simulate a large data-set containing all permutations of four prime numbers and test each for its ability to fulfill the two requirements of the puzzle.

To help you solve this, I have provided an outline for you to follow. We will take the following strategy:

Make an object containing prime numbers. We can limit ourselves to the first 17 prime numbers (up to 53).

The schoolmath::primes() function provides a starting point:

> library(schoolmath)
> pNumbs <- primes(1, 53)[-1]
> pNumbs

 [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47 53

> # Score to beat
> result <- 4*53

⁵⁹

⁶⁰

Randomly sample four numbers from our pool of 17 numbers.⁶¹
Determine⁶² if the sum of each combination of three digits is also a prime number.
If⁶³ all four 3-number combinations are prime and if they have the lowest 4-digit sum, then they are saved as a new object called solution. The sum of these four numbers is the new “score to beat” (i.e. overwrite the result object).
Continue up to 10000 iterations, overwriting solution if a better four-number combination is found.

Your solution should be: 5, 7, 17, 19.

For more on lists see page 7.3.1.↩
This is sometimes frowned upon in the R community, but sometimes you really do only want to use the function once.↩
while loops, discussed in section @ref(sub:while_loops), are a variant of for loops.↩
Make a vector of prime numbers to sample from, pNumb, using schoolmath::primes(). Exclude the first entry, since primes() has a bug which includes 1 as the first prime number.↩
Here, I have also created a result vector, which contains the highest possible sum. This is the “score to beat”, and will be replaced by the sum of the four sampled digits, but only if it is less than the stored value.↩
sample() within a for loop.↩
schoolmath::is.prime(), sum() and a for loop↩
if statement↩