Chapter 20 Case Study: Depletion Curves

20.1 Learning Objectives

By the end of this chapter you should be able to:

Low-level EDA (exploratory data analysis)
Data munging with dplyr
Merging files
Base graphics for quick plots and ggplot2

20.2 Introduction

Microscale thermophoresis (MST) experiments are used to measure affinity of interaction between two different molecules. Affinity is expressed in the value of the dissociation constant (KD). The goal of the experiment in this case-study is to measure the dissociation constant (KD) of the binding between the neutrophil elastase and its ligand elafin directly in plasma.

In MST experiments, one of the two molecules is labelled with a fluorescent dye. Here, neutrophil elastase was labeled with Alexa Fluor 647 dye. A spot in the solution is then heated with an infrared laser to create a temperature gradient of 3 - 4 degrees Kelvin. The temperature gradient makes different molecules move differently depending on their size, charge, and hydration shell. This effect is called thermophoresis and can be monitored by measuring the fluorescence around the heated spot in the solution. The degree to which the fluorescently labelled molecule is bound to its ligand (free to completely bound) is reflected in the amount of thermophoresis.

In this case study we will work with the data to develop depletion curves, of the raw data.

20.3 Case-study Details

For this analysis, 15 concentrations of the unlabeled ligand molecule were tested. The raw data which we will use is contained in two files (described below) and are depicted in fig. 1.

Curves.txt contains thermophoresis time traces for 15 samples containing the same concentration of the fluorescently-labeled binding molecule and 15 different concentrations of the unlabeled ligand. For each ligand molecule concentration, the experiment is performed in triplicates. The columns are arranged as presented in the margin.

time
fluorescence for 0.031nM run1
time
fluorescence for 500nM run1
time
fluorescence for 0.031nM run2
time
fluorescence for 500nM run2
time
fluorescence for 0.031nM run3
time
fluorescence for 500nM run3

BG.txt contains the data for thermophoresis time traces of the background (only plasma in the sample, no fluorophore) in triplicate. The columns arrangement is presented in the margin.

time
fluorescence run1
time
fluorescence run2
time
fluorescence run3

Your task is to subtract BG.txt curves from Curves.txt. Then, to calculate depletion levels by dividing the fluorescence level when the laser for temperature gradient is on (e.g. from 29th second until 29.9th second) by the fluorescence level when the laser is off (e.g. from 4.5th second until 5.4th second). Next, you’ll average triplicates of depletion levels for each ligand concentration, and plot the binding curve (averaged depletion levels against ligand concentrations).

20.4 Read in the background data

Begin a new R Markdown script. Fill in the details with the title of this exercise, your name & date.

Exercise 20.1 Import the BG.txt file and save it as an object called bg. Use the typical functions for exploring the structure of a data frame to confirm this file is what you think it should look like.

Pay attention to how the variables are called. Notice that they all begin with an X, because R doesn’t like beginning variable names with digits. You should also notice that the columns don’t exactly match what we expected given the file description above.

Exercise 20.2 Remove the last column of bg.

bg <- ___[-ncol(___)]

We can blindly remove the last column, but if we want to be thorough, we should run a check first.

Maybe this particular file has this feature, but in the future, other files will be OK, so if we blindly remove the last column, we’ll lose data! Therefore, we can run a check, if the last column is full of NA values then we remove it, otherwise we keep it.

Exercise 20.3 (Optional) Write an if statement that checks if the last column is only NA values. If TRUE then remove it.

The columns containing time are identical, so they are redundant.

Exercise 20.4 Use regular expressions to look for redundant time columns and remove them. This should correspond to the third and fifth columns, but we want to avoid hard coding (e.g. c(3,5)).

bg <- ___[-grep("___", names(___))]

Exercise 20.5 Change the column names to "time", "bg_1", "bg\\_2", and "bg\\_3".

names(___) <- c("___", paste0("___", 1:3))

You should have been able to complete all the exercises up until this point with the material covered until the end of day 2. The next tricks are how to work with the data further and combine background and raw data. We can continue to use the data in this format, which is wide format data, but we will use the material covered on day 3 to transform this to long format data.

Exercise 20.6 Gather bg using the time variable as the id variable.

bg.t <- gather(___, ___, ___, -___)

The resultant data frame should have the dimensions 1425, 3.

20.5 Read in the raw data

curves.txt contains more information, since we have all 15 concentrations of the unlabelled ligand molecule

Exercise 20.7 Read in Curves.txt and save it as an object called curves. Use the typical functions for exploring the structure of data frames.

When looking at the structure of curves you’ll notice something strange.

Exercise 20.8 Some variables are classified as factors while other are numerical. This is problematic, because we know that all our variables should be numerical. Where do you think this comes from?

The reason for this is that NA characters are coded using different symbols, so we need to take this into account when reading in the file!

Exercise 20.9 Looking at the structure of the data frame, what are the different ways in which NA values are coded?

Exercise 20.10 Import the data again, this time making sure to specify the different ways that NA values can be coded.

curves <- read.delim("___", na.strings = c("___", "___"))

Exercise 20.11 We again find that we have a mystery last column containing only NA values. Use the conditional statement you wrote earlier to remove this column. Also, the last row contains some NA values, so we can remove it.

curves <- ___[-ncol(___)]
curves <- curves[___(___),]

The dimensions of curves should now be 475, 90

Exercise 20.12 We also have the same issue that the time variable is repetitive. We should write a statement that confirms these are really identical before removing them, but in this case we know they are identical. Using the regular expression you wrote above, remove all the redundant time variables except for the first.

curves <- curves[!___(t(___))]

The curves data frame now has the dimensions 475, 46 and the names X0.0_t, X0.0_f, X0.1_f, X0.1_f.1, X0.2_f, X0.5_f. We should give some meaningful names to make it easier to work with later on. Notice that the concentrations are in the variable names, but that they have been rounded, this causes some problems in particular for the fourth and fifth columns, where \(0.061\) and \(0.122\) have both been rounded to \(0.1\). We also want a consistent naming format like what we did with the bg data frame, because we’ll combine this all pretty soon.

Use the following command to define the concentrations used. The "_" is part of the new name. Just replace the ___ parts.

conc <- c(0.031, 0.061, 0.122, 0.244, 0.488, 0.977, 1.953, 3.906, 7.812, 15.625, 31.25, 62.5, 125, 250, 500)
names(___) <- c("___", paste("___", conc, "_", rep(___, each = ___), sep = ""))

Exercise 20.13 We also want to have a consistent naming convention for the column headers to make it easier to work with this in the future. We will gather this data afterwards. Change the column names to:

time
conc0.031_1, conc0.061_1, conc0.122_1, \dots
conc0.031_2, conc0.061_2, conc0.122_2, \dots
conc0.031_3, conc0.061_3, conc0.122_3, \dots

# Now gather curves and remove the conc tag from the key.
curves.t <- gather(___, ___, ___, -___)
curves.t$key <- sub("^conc", "", ___)

20.6 Combine background and raw data

Exercise 20.14 Use rbind() or bind_rows() to put the two data frames together, then use separate() to split up the var column into probe and rep variables.

Exercise 20.15 Set the probe and rep variables to factors.

total$probe <- as.factor(___)
total$rep <- as.factor(___)

You should now have a data frame called total with dimensions measuring (22800, 4) and variable names time, probe, rep, value. If you had a hard time to get to this point. This data frame has been provided for you as total.txt, which you can import in to R.

20.7 Normalise and plot the data

We now have a large data frame with 16 time series: one for each of the 15 concentrations and a reference background. We need to subtract the background at each time point from all 15 concentrations, but before we do that, let’s use some simple visualisations to see what we’re dealing with.

Exercise 20.16 Using base package plotting functions plot value against time colouring by probe. To make a proper plot we’d need a legend and adjust attributes like colour and coordinates. For our purposes, we’re just going to make a quick and dirty plot for exploratory purposes (see figure 20.1).

plot(value ~ time, data = total,
     col = as.numeric(probe),
     pch = ".")

Figure 20.1: The plot for the above exercise

Now we’re ready to subtract the background curves from the measured values. This is where the power of long-format data and plyr comes into action!

Exercise 20.17 Calculate the mean for each probe at each time point. Save this a a new data frame called total.sum and plot it as before (see figure 20.2).

# First take the average of all reps for each probe at each time point:
total %>%
  group_by(___, ___) %>%
  summarise(value = mean(___, na.rm = T)) -> total.sum

plot(value ~ time, data = total.sum,
     col = as.numeric(probe),
     pch = ".")

Figure 20.2: Exercise figure.

Exercise 20.18 Subtract the background value for each probe.

# Calculate the differnce above bg:
total.sum %>% 
  group_by(___) %>%
  mutate(___ = ___ - ___[___ == "___"]) -> total.sum

Exercise 20.19 Plot this as before. You may have to make sure that time is on a continuous scale, depending on how you normalised the data (see figure 20.3).

Figure 20.3: The plot for exercise above.

20.8 Calculate and plot the depletion curves

Now we are ready to calculate depletion levels. To do this we’ll need to divide the average fluorescence level when the laser is on (29.0-29.9s) by the fluorescence level when the laser is off (4.5 - 5.4s).

Exercise 20.20 To make the data easier to work with, filter for only the times of interest.

total.sum %>% 
  filter((___ >= 4.5 & ___ <= 5.4) | (___ >= 29 & ___ <= 29.9))  -> total.sum

Exercise 20.21 Make another column in the data frame called last that contains the values “off” and “on” for the appropriate time frames.

total.sum %>% 
  mutate(laser = ifelse(___ >= 29 & ___ <= 29.9, "on", "off")) -> total.sum

Exercise 20.22 Next, take the average measurement when the laser is on and when the laser is off for each probe.

total.sum %>%
  group_by(___) %>%
  summarise(averageON = mean(diff[___ == "___"]),
            averageOFF = mean(diff[___ == "___"])) -> total.sum

Exercise 20.23 Make a new column called depletion that contains the ratio of the average “on” value divided by the average “off” value.

total.sum %>%
  mutate(___ = ___/___) -> total.sum

Now we have the values for the binding curve (averaged depletion levels against ligand concentrations).

Exercise 20.24 First, remove the background probe since we don’t need that for plotting. Once the background is gone, all of our probes are numerical, which means we can use them as the x axis in our plot. Make sure they are classified as such, and not as characters.

total.sum %>% 
  filter(___ != "___") %>% 
  mutate(___ = as.numeric(as.character(___))) -> total.sum

Exercise 20.25 Draw the raw depletion curve: depletion versus probe. Just calling plot() won’t work, because the order is wrong, so you’ll have to arrange the data first (see figure 20.4).

ggplot(total.sum, aes(probe, depletion)) +
  geom_line()

Figure 20.4: Raw depletion curve.

20.9 Transform and plot the depletion curve

Exercise 20.26 Shift the curve so that the baseline at the high ligand concentrations is at the zero level. To do this we'll just normalize so that the minimum value is at zero. Afterwards, draw the normalized depletion curve, and set the x-axis to a log scale (fig 20.5).

The KD is the concentration measured on x axis that correspond to max(Y axis)/2.

total.sum %>% 
  mutate(___ = ___ - ___[which.min(___)]) -> total.sum

ggplot(total.sum, aes(log(probe), depletionNorm)) +
  geom_line() +
  geom_hline(yintercept = max(total.sum$depletionNorm)/2, col = "red")

Figure 20.5: The normalised depletion curve plotted on a log x scale.