Chapter 21 Element 10: Control Structures
21.1 Learning Objectives
By the end of this chapter you should:
- Know how to create basic custom functions
- Understand the two main types of control structures: conditional statements and reiterations
- Be able to manage multiple files in an efficient way
21.2 Defining New Functions
We can also write our own functions. You can get lots done without creating your own functions. However, they are useful for integrating repetitive actions into a longer script, or executing complicated actions from within other functions. Returning to the linear equation from the first exercise, we can make a function called equation()
, using the following notation:
# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56
lin
that accepts three arguments, x
, m
and b
, to calculate predicted values based on a \(y=mx+b\) equation. Just stick with the values of m
and b
we’ve been using so far.
21.3 Scoping
Scoping refers to the set of rules that a programming language uses in finding the value of an object. It’s exactly what happens when we call:
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# [11] FALSE FALSE FALSE FALSE FALSE
R looks in the global environment and sees if there are objects called m
and then it prints it to the screen. Scoping also happens inside functions. But if a name isn’t defined inside a function, R will progressively look one level up, until it reaches the global environment. This happens when the function is run, not when it’s created. That means that the output of a function can be different depending on objects outside its environment. That can be exactly what you want, but it can also be very dangerous!
# Error in plus(1): object 'y' not found
# [1] 11
# [1] 20
m
and b
as previously used. This way, the user has the ability to call these arguments, or not. You should be able to reproduce the following commands:
# [1] 1 2 3 4 5 6 7 8
# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56
# [1] 65 70 75 80 85 90 95 100
# [1] 11 12 13 14 16 17 18 19
That’s all good, but this:
# [1] -0.4 1.8 -0.4 4.1 -0.4 6.3 -0.4 8.6
still doesn’t work as expected! There are many solutions for this. We’ll take a look at a typical way using one of the apply functions. lapply()
takes a list or a vector (in this case either xx
or m2
) and a function (lin()
, plus any additional arguments) as input.56
# [[1]]
# [1] -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4
#
# [[2]]
# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56
# [[1]]
# [1] -0.40 0.72
#
# [[2]]
# [1] -0.4 1.8
#
# [[3]]
# [1] -0.4 3.0
#
# [[4]]
# [1] -0.4 4.1
#
# [[5]]
# [1] -0.4 5.2
#
# [[6]]
# [1] -0.4 6.3
#
# [[7]]
# [1] -0.4 7.4
#
# [[8]]
# [1] -0.4 8.6
sapply()
tries to simplify the output:
# [,1] [,2]
# [1,] -0.4 0.72
# [2,] -0.4 1.84
# [3,] -0.4 2.96
# [4,] -0.4 4.08
# [5,] -0.4 5.20
# [6,] -0.4 6.32
# [7,] -0.4 7.44
# [8,] -0.4 8.56
Now, if we didn’t want to define a function _outside of lapply()
or sapply()
, we could have created an anonymous function, i.e. a function without an explicit name.57 Again, note that the m
in our anonymous function masks the m
in our environment, which remains unchanged.
# [,1] [,2]
# [1,] -0.4 0.72
# [2,] -0.4 1.84
# [3,] -0.4 2.96
# [4,] -0.4 4.08
# [5,] -0.4 5.20
# [6,] -0.4 6.32
# [7,] -0.4 7.44
# [8,] -0.4 8.56
This logic is the reason this also works:
# [,1] [,2]
# [1,] -0.4 0.72
# [2,] -0.4 1.84
# [3,] -0.4 2.96
# [4,] -0.4 4.08
# [5,] -0.4 5.20
# [6,] -0.4 6.32
# [7,] -0.4 7.44
# [8,] -0.4 8.56
We could have also been a bit more complicated and use a for loop
. But we’ll return to this on page @ref(sec:for_loops).
# [1] -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4
# [1] 0.72 1.84 2.96 4.08 5.20 6.32 7.44 8.56
What happens when we have more than one b
? Which is the third argument? i.e. what if b
is -0.4
and 10
?
# [1] 0.6 1.6 2.6 3.6 4.6 5.6 6.6 -0.4 -0.4 -0.4 -0.4 -0.4
# [13] -0.4 -0.4 -0.4
# [1] 11 12 13 14 15 16 17 10 10 10 10 10 10 10 10
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0.72 1.8 3 4.1 5.2 6.3 7.4 8.6
# [2,] 11.12 12.2 13 14.5 15.6 16.7 17.8 19.0
What about when we have many m
s and many b
s? We can nest our solutions. Either as for loops
or lapply()
functions.
21.4 Control Structures
Control structures are a core feature of every programming and scripting language. They allow you to repeat or execute a set of commands given the outcome of a specific condition (e.g. a logical expression). Control structures are used when you assemble a series of commands into a script. We will take a look at two of the most common structures and give a practical example with a data-set provided for the workshop.
if statements will execute commands based on the outcome of a logical test.
for loops will repeat commands for a fixed number of times.58
21.5 if statements
if statements
, including else
statements, allow your script to proceed based on the result of a logical expression. For example:
# [1] "Ready to proceed."
Exercise 21.3 Returning to the for loop
from the previous section, suppose that we needed to do a quality control for data-sets containing more than 10% incomplete observations. Insert an if statement
into the for loop
which checks for this. If the data frame passes our quality check, then add it to the final peptides
data frame, if it doesn’t then print an error message to the screen.
Use an if statement
, nrow()
, sum()
, print()
and `paste()
# [1] "Error in file data/peptides/p53_noacid_100_E15.txt"
If you completed all the exercises on the preceding page, your data frame should have the following dimensions:
# [1] 44 6
There is one further extension of if statements, the ifelse()
function:
21.6 for loops
for loops
allow you to repeat a set of commands a given number of times. In the generic example below, i takes on a number and automatically increases it by one at the end of the loop:
# [1] "A is at position 1"
# [1] "B is at position 2"
# [1] "C is at position 3"
Contrast the above example to the for loop
below, where i is the actual value in the vector, not a number:
# [1] "The letter is A"
# [1] "The letter is B"
# [1] "The letter is C"
This may seem like a trivial example, but looping is quite powerful when you need to repeat a given task several times. Consider the five files beginning with p53 in you workshop folder. See the repository here. Each file contains two columns of data. If we wanted to combine all five files into one data frame, we may naively do something like the following:
peptides1 <- read.delim("p53_noacid_100_E13.txt")
peptides2 <- read.delim("p53_noacid_100_E15.txt")
# and so on, creating a data frame for each file....
After creating 5 separate data frames, we could then merge them all into a single data frame. This is not only extremely tedious, but also error-prone. Imagine what would happen if we had 1000 files? Instead we can use a for loop
to read in each file and build a cumulative data frame.
read.delim()
, list.files()
, bind_rows() and a
for loop. Create a
for loop` that reads in all files beginning with p53 in the workshop folder, one after another, continuously building a single data frame from the individual data frames.
Save your data frame as an object called peptides
. It should have the following properties:
# [1] "mz" "intensity"
# [1] 55 2
peptides
data frame comes from. Therefore, we need to add a third column (named variables
to each data frame before merging it with the compiled peptides
data frame. peptides$variables
should contain the name of the source file for every observation. Use rep()
and nrow()
peptides
should now have the following properties:
peptides
so that is has the following names. Use separate()
and bind_cols()
# [1] "mz" "intensity" "var1" "var2" "var3"
# [6] "var4"
21.6.1 while loops
while loops
are a variant of for loops
, except that in this case, they repeat commands while a certain condition is true. However, be cautious! If the condition never becomes false, your script will never exit the loop. While loops
have the generic structure of:
# [1] 1
# [1] 2
# [1] 3
21.7 Simulation Challenge
As a challenge, test your abilities with control structures by trying to solve the following puzzle.
Given that a prime number is a whole number, greater than 1, which can only be evenly divided by itself or 1, find the group of four prime numbers such that:
- The sum of any combination of three numbers is also a prime number, and,
- The sum of all four numbers is as small as possible.
One solution is to begin with the smallest four prime numbers and test if requirement 1 is fulfilled. If this is not the case, then we can begin taking larger prime numbers. This would work, but is a bit cumbersome because you would then have to test the other 4-number combinations which may have smaller group sums. An easier solution is to use use for loops
and if statements
to simulate a large data-set containing all permutations of four prime numbers and test each for its ability to fulfill the two requirements of the puzzle.
To help you solve this, I have provided an outline for you to follow. We will take the following strategy:
- Make an object containing prime numbers. We can limit ourselves to the first 17 prime numbers (up to 53).
The schoolmath::primes()
function provides a starting point:
[1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53
Randomly sample four numbers from our pool of 17 numbers.61
Determine62 if the sum of each combination of three digits is also a prime number.
If63 all four 3-number combinations are prime and if they have the lowest 4-digit sum, then they are saved as a new object called
solution
. The sum of these four numbers is the new “score to beat” (i.e. overwrite theresult
object).Continue up to 10000 iterations, overwriting
solution
if a better four-number combination is found.
Your solution should be: 5, 7, 17, 19.
This is sometimes frowned upon in the R community, but sometimes you really do only want to use the function once.↩
while loops
, discussed in section @ref(sub:while_loops), are a variant offor loops
.↩Make a vector of prime numbers to sample from,
pNumb
, usingschoolmath::primes()
. Exclude the first entry, sinceprimes()
has a bug which includes 1 as the first prime number.↩Here, I have also created a
result
vector, which contains the highest possible sum. This is the “score to beat”, and will be replaced by the sum of the four sampled digits, but only if it is less than the stored value.↩sample()
within afor loop
.↩schoolmath::is.prime()
,sum()
and afor loop
↩if statement↩