7 Writing Functions

7.1 Learning Objectives

Introduce the fundamentals of writing custom functions:

  • Their use cases,
  • Basic structure, and
  • Limitations

7.2 Preparation

Until now, you have been reviewing material that should already be familiar to you from earlier courses. In this section we’ll begin introducing new topics.

Before we get going, you should have chosen one of the three challenges in the previous chapter to focus on for the rest of the course. Each challenge consists of three tasks and two output examples were provided so you can see where we’re headed. Your goal so far was to write the R commands to produce the output shown in those two examples given for each challenge. You don’t need to reproduce the plot output exactly, but for the numerical output, yes! As we continue with the coure, we’ll keep returning to these examples for you to apply your skills. You’ll also trade notes with your classmates working on the same challenge, so make sure you give it a try.

  1. Make sure you’ve downloaded the data files from the course repository. You can find the link in the couse pages.
  2. Make sure you are working in an RStudio project. You should see the name of your project in the upper-right corner of RStudio. Don’t navigate around your hard drive using the files panel inside RStudio. All the files you need should be in your project folder! If you see (none) then you don’t have a project. Plese open the DAwR_I_data.Rproj after you’ve unzipped (extracted) the repository contents.
  3. Begin a new script with a suitable header. This should include a title, your name, date and a description. Load all the necessary packages at the top of your script and comment throughout.

7.3 Introduction

As you develop your R skills and begin writing complete analytical workflows, you’ll soon face one or more of these typical issues:

  1. You’ve created a single R script, that’s as unmanageable as it is long. No matter how well you’ve commented and formatted your code, you’ll eventually lose oversight of long scripts. Thus, you’d like to isolate distinct parts, so you can better manageable the entire workflow.
  2. You’ve solved a specific & isolated issue in your analytical workflow. For example, importing a file with a difficult format or idiosyncratic structure. Thus, you’d like to archive that code chunk, so you can focus on the next step.
  3. You’ve solved a general & integrated issue in you analytical workflow. For example, a generic plot that can take some subset of your data, like only one Uniprot ID, and export a custom plot, with the Uniprot ID integrated into the file name and plot title, to the hard drive. Thus, you want to repeat the same code with slight adjustments, thus avoiding multiple copy/paste/edit cycles, each risking the possibility of error.

Solutions to the above issues set out the goals of isolating, archiving, and repeating chunks of functional code. By functional code, I mean a short chunk of code that does, ideally, _one thing, and one thing only, and it does it very well. Writing custom functions is the solution to all the issues stated above. In addition, we’ll eventually discover that custom functions also allow for elegant automation of our workflows.

Our custom functions will allow us to:

  1. Give a descriptive and meaningful name to a code chunk, making out code easier to understand.

  2. Use placeholders for the parts that need to change, while having only a single copy of the original code chunk.

  3. Reduce repetition in our code, and thereby the chance of making hard-to-spot errors.

7.4 When should you write a function?

The general rule-of-thumb is that if you copy/paste a code chunk more than 3 times, you should write a function. That deals with the third reason given above, repeating code. But you can just as easily make a function for isolating and archiving code. Even if it only appears once in your script, you may want to put it aside so you can move on to the next hurdle.

Remember, a function should, ideally, do one thing, and one thing only, and it should do it well.

Let’s look at an example. We want to make a plot using the mtcars dataframe. First, let’s clean up the dataframe. We’ll make those variables that are categorical proper factor variables, and we’ll also take the names of the cars from the row names metadata and place it as an actual variable in the dataframe.

# load packages
library(tidyverse)

# Using original syntax and functions
# mtcars %>% 
#   mutate_at(vars(vs, am, gear, carb), list(f = as.factor))

# Using updated syntax and functions
mtcars %>% 
  mutate(across(c(cyl, vs, am, gear, carb), 
                list(f = ~as.factor(.x))),
         car = row.names(mtcars)) -> mtcars

glimpse(mtcars)
#> Rows: 32
#> Columns: 17
#> $ mpg    <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 2…
#> $ cyl    <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8…
#> $ disp   <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 3…
#> $ hp     <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 1…
#> $ drat   <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3…
#> $ wt     <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3…
#> $ qsec   <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 1…
#> $ vs     <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0…
#> $ am     <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ gear   <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3…
#> $ carb   <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4…
#> $ cyl_f  <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8…
#> $ vs_f   <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0…
#> $ am_f   <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ gear_f <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3…
#> $ carb_f <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4…
#> $ car    <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710",…
Exercise 7.1 Run the commands above to update the mtcars dataframe. The dimensions should be 32 x 17 and the object mtcars should be visible at the top of the global environment panel in RStudio.

Alright, let’s make a generic plot. The idea is that we want to switch out the mapping for the x, y and color aesthetics at will. And, perhaps we’ll want to include some additional material in the plot later on, like a specific title, or a sub-title that summarizes what we see.

ggplot(mtcars, aes(x = wt, y = mpg, color = cyl_f)) +
  geom_point(shape = 16, alpha = 0.6) +
  NULL

To write our function, we first need to identify all the parts that could change each time we run the above plotting command. In this case there are three arguments. Note that we don’t need to limit ourselves to using only arguments inside functions as the place holders, although it is common and logical place to start. We can get creative and use placeholders in conditional statements that will control the flow of our code. We’ll get to that soon enough.

The next step is to write a generic command using palceholders, i.e. generic names instead of the actual names. We can name our placeholders as we like, e.g. as such:


ggplot(mtcars, aes(x = independent, y = dependent, color = grouping)) +
  geom_point(shape = 16, alpha = 0.6) +
  NULL

This isn’t too bad, but just like objects, we want these placeholders to be short and descriptive. It may seem confusing but we’re going to call our placeholders after the arguments they are assigned to, as such::


ggplot(mtcars, aes(x = x, y = y, color = color)) +
  geom_point(shape = 16, alpha = 0.6) +
  NULL

Remember, the LHS of the = operator is the name of the arguent, the RHS is the input, here objects x, y and color. Since it doesn’t matter what they’re called, let’s call them by their names. Notice that we haven’t changed the data set. We’ll get to that in a second.

Now we can make our function using the function() function – yes, R is very literal sometimes! We assign the output of the function() function to an object, here called plot_cars. Remember what we discussed when we introduced objects in the introductory course: “Everything that exists is an object,” that means that functions are also objects. It seems obvious, but it’s worth pointing out.

The place holders are given as arguments to function(). As the moment we don’t have any values assigned to them. The name of the arguments to function() must match the names of the placeholders!

Finally, we cap the whole thing off with curly brackets: {}. We already know that round brackets, () define the arguments for a function, and that the square brackets, [] define the indexing of an object, typically a data-storage object. The curly brackets, {} are used throughout R to define a code chunk that will be executed together, as a single unit. In RStudio, you’ll notice a little arrowhead icon your scripts left margin where the script’s row numbers are. Click on this little arrowhead and it will collapse your function into a sigle line.

plot_cars <- function(x, y, color) {
  ggplot(mtcars, aes(x = x, y = y, color = color)) +
    geom_point(shape = 16, alpha = 0.6) +
    NULL
}
Exercise 7.2 Before proceeding with the following exercises, make sure you’ve run the commands above to define the plot_cars() function. It should be visible in the functions section at the bottom of your global environment panel in RStudio.

OK, so how do we get our plots? You may assume that we can just go ahead and call our function with soe nice arguments! We’ll you’d be right and well within reason to assume that!

But, things operate a bit differently when using tidyverse functions. We’ll explore this in depth in the next section! For now try to answer the following questions.

Exercise 7.3 Confirm that you’ve updates mtcars and defined the custom function plot_cars() as above.

Consider the three ways of inputting arguments into our custom function, shown below.

For each attempt, answer the following questions.
  1. Yes of No: Should we expect this command to work given what we know of R and executing functions? Explain, why or why not.

  2. Execute the code. Does it work or not? Can you explain why or why not now? Are you surprised by the outcome?

  3. Is is a good idea that this does or doesn’t work? Why?

# Attempt A:
plot_cars(x = wt, y = mpg, color = cyl_f)
# Attempt B:
plot_cars(x = "wt", y = "mpg", color = "cyl_f")
# Attempt C:
plot_cars(x = mtcars$wt, y = mtcars$mpg, color = mtcars$cyl_f)

To get our function to work we need to understand a bit about non-standard evaluaiton, programming in the tidyverse and scoping rules for functions. We’ll get there in the next section! For now, let’s focus on mastering our basic functions.

Let’s recap and elaborate on the steps taken so far:

  1. Identify a functional code chunk

We decided that some chunk of code operates as a cohesive and complete functional unit. Maybe I just want to isolate it so I don’t make any inadvertant istakes, or I want to archive it by putting it in another script, so I don’t need to be bothered by it, or I want to repeat it as many times as I want. In the end it’s allthe same difference, we have some functional code chunk.

This is an important first step! Begin with something that you know already works in your plain R script! Starting with a blank function is a lot more tricky, until you get the hang of it, so make your life easier by following these steps.

  1. Identify placeholders

Then we identified where we would like to change parts of it when we call that code chunk. Recall two things here: (1) not every function needs arguments, e.g. ls() prints the names of the user-defined objects in the global environment to the R console (try it!), and (2) not every function is called for it’s output, some are called for their side effets. Output means we will see some text printed to R console, or an image printed to the graphics device. Side effects are when we get neither, but some action was accomplished. For example, options(digits = 2) set a global R setting for printing numbers in the console with up to 2 significant digits. Another example, write.file() and ggsave() both have the side effect of saving an object to the hard drive.

  1. Name your function

Our function does one thing, it prints a nice plot to the screen, and it does it well. The name is short and descriptive.

  1. Provide the arguments to function()

The placeholders now becomre the arguments for function(), which in turn are the arguments for our custom function.

  1. Wrap it up with {}

The body of the function in contained within {}

  1. Test your function

Ok, here’s where things got murky. We did get some failures and a success, but that success just doesn’t feel right. We’ll discuss that in more detail in class.

This is an important part of the “do not repeat yourself” (or DRY) principle. The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.

7.5 Exercises

Exercise 7.4 The first thing we did in this chaper was data munging. It’s a neutral term that refers to all the cleaning and pre-processing steps we do to our data before we can actually begin to analze it.

Write a function that encapsulates that data munging step so that I just have a single line to call and not several commands spread over several lines in my script.
Exercise 7.5 Before we move on, organize your functions by follow the steps below.
  1. Open a new R script and save it as “functions.R” in the root directory of your RStudio project folder. This project should contain the data files you downloaded from the course repository.
  2. Copy and paste the entire plot_cars() function, plus the function for data munging you made in the last exercise, from your script to “functions.R.” Save the file.
  3. Restart R using the menu options in RStudio (Session > Restart R) to be sure you’re working in a fresh Environment.
  4. Back in your main script, enter the command source("functions.R") after you load any necessary packages. If all goes well and you’re working in an RStudio project, you should see the two functions in your Global Environment. If you don’t make sure you are working inside a project and check your working directory with getwd(). It should be the root directory of your project,where both your current script and the “functions.R” script exist. You shouldn’t needt o change the comand. If it doesn’t work, it likely that you are not in the correct project.
Exercise 7.6 Use the following instuctions to complete this exercise as well as Exercises 7.7 & 7.8. These exercises become progressively more difficult, so make sure you give them a try!

In each exercise you are provided with three pieces of information:

  1. A code chunk,
  2. An example of a valid input, and
  3. Expected output from a function.

Your job is to write the function that produces the output shown from the example input. Answer these questions to organize yourself:

  1. Describe the purpose of the command in the code chunk.
  2. Choose an appropriate function name and corresponding arguments.
  3. Write a custom function following the syntax provided in this chapter.
  4. Test your function. Do you see the same output?

The following is code chunk 1:

# Hint: The output will be a proportion, i.e. [0,1], but of what?
sum(is.na(x))/length(x)

# Where x is a vector containing some missing values

Example of valid input:

xx <- 1:15
xx[seq(1, 15, 3)] <- NA
xx
#>  [1] NA  2  3 NA  5  6 NA  8  9 NA 11 12 NA 14 15

Custom function output, given xx:

#> [1] 0.3333333
Exercise 7.7 Follow the instructions given for Exercise 7.6.

The following is code chunk 2:

# Hint: On plots, this is a common, yet uninformative, error bar.
sd(x)/sqrt(length(x))

# Where x is a numeric vector

Custom function output, given xx, as per Exercise 7.6:

#> [1] 1.424001

Did you manage to deal with the missing values?

Exercise 7.8 Follow the instructions given for Exercise 7.6.

The following is code chunk 3:

# Hint: It's the value which the best linear model will minimize
sum(x$residuals^2)

# where x is an object of class lm, see below:

Example of valid input:

yy <- lm(mpg ~ wt, data = mtcars)
class(yy)
#> [1] "lm"

Custom function output, given yy:

#> [1] 278.3219

You can confirm your result by calling:

anova(yy)
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 847.7252 847.725250 91.37533 0
Residuals 30 278.3219 9.277398 NA NA

Can you see where the value we calculate appears?

Exercise 7.9 Write posn_match(), a function that takes two vectors of the same length and returns an integer vector containing the positions where the two vectors agree.

Example input:

aa <- seq(1, 30, 3)
aa
#>  [1]  1  4  7 10 13 16 19 22 25 28
bb <- c(1, 28, 7,  22, 13, 16, 19, 10, 25, 4)
bb
#>  [1]  1 28  7 22 13 16 19 10 25  4

Vectoraa and bb have identical values at positions 1, 3, 5, 6, 7 & 9. Thus, the function’s output is:

posn_match(aa, bb)
#> [1] 1 3 5 6 7 9
Exercise 7.10 There are three short questions to this exercise concerning the Winsor mean.

The Winsor mean is a robust version of the mean. It replaces the lower and upper \(5^th\) percentiles with the closest remaining values within the central 90th percentile.

Consider the example input mm, a numerical vector with some extreme positive values:

set.seed(30)
mm <- rnorm(n = 30, mean = 0, sd = 9)
mm <- c(mm, -346, -764)
mm
#>  [1]  -11.5966638   -3.1292046   -4.6946596   11.4612585
#>  [5]   16.4206854  -13.6017715    0.9945724   -6.8471660
#>  [9]   -6.0290732    2.4706772   -9.2094482  -16.3745812
#> [13]   -6.0101083   -0.5336819    7.9214932    2.4166162
#> [17]   -0.1762144   -4.7245228  -12.6839829  -16.5059029
#> [21]   -1.4248286    6.7898391   -8.2091666    7.1993803
#> [25]   13.4149770   -9.8676179   -4.8079862  -12.7908272
#> [29]  -11.1846447    2.0874256 -346.0000000 -764.0000000

We can confirm the extreme values in a histogram:

data.frame(x = mm) %>% 
  ggplot(aes(mm)) +
  geom_histogram(binwidth = 10, center = 5) +
  theme_classic() +
  coord_cartesian(expand = 0)

We can calculate the Winsor mean using the custom function defined here:

winsor_mean <- function(x, cap = 0.05) {
  # Remove NAs and sort vector
  x <- na.omit(x)
  x <- sort(x)
  
  # Replace lower cap with next highest value
  x[x <= quantile(x, cap)] <- NA
  x[is.na(x)] <- x[min(which(!is.na(x)))]
  
  # Replace upper cap with next lowest value
  x[x >= quantile(x, 1 - cap)] <- NA
  x[is.na(x)] <- x[max(which(!is.na(x)))]
  
  # Calculate the Winsor mean:
  mean(x)
}

Testing our results reveals:

# The standard version of the mean is not robust
mean(mm, na.rm = TRUE)
#> [1] -37.47579
# The Winsor mean is robust
winsor_mean(mm)
#> [1] -4.03594
  1. Describe the value of 0.05 assigned to the cap argument in function().
  2. Explain what the following command does and why.
winsor_mean(mm, cap = NULL)
#> [1] -37.47579
  1. Given the ways we can use our function, could we have composed it in a better way?