Chapter 17 `Protein.df` Revisited

In the last exercises, including the previous chapter using tidy data, we worked on the protein.df data frame as a long list, not as actual tidy data. Let’s take a look at how we can maximize the use of our tidyverse functions.

17.1 Import and Clean Data

Exercise 17.1 (Import data) Fill in the blanks to:

Import in “Protein.txt”,
Convert it to a tibble,
Remove contaminants, and
Assign it to the object protein.df.

Use this as a template:

read.delim("___", stringsAsFactors = FALSE) %>% 
  as_tibble() %>% 
  filter(___ != ___) -> protein.df

17.2 Process Intensities

Exercise 17.2 Using the dplyr functions, fill in the blanks to:

Transform all Intensity.* columns to log10. (mutate_at())
Do additions of the H+M and M+L columns as we have done previously, save the new columns as H.M, M.L (mutate())
Select only the Uniprot, H.M and M.L. (select())
Make a tidy data set with three columns Uniprot, Ratio and Intensity (gather())
Assign this to an object called onlyInt

Use this as a template:

protein.df %>% 
  mutate_at(vars(starts_with(___)), ___) %>%    # Calculate log10
  mutate(H.M = ___ + ___,                       # Add log10s and rename columns
         M.L = ___ + ___) %>% 
  select(___, ___, ___) %>%                     # Take columns of interest
  gather(___, ___, -___) -> ___                 # gather and save

17.3 Process Ratios

Exercise 17.3 Using the dplyr functions, fill in the blanks to:

Using the dplyr functions, select Uniprot and all columns that begin with Rat, but that do not end in Sig. (select())
Make a tidy data set with three columns Uniprot, Ratio and Expression (gather())
Remove all observations where Ratio is Ratio.H.L. (filter())
Rename the levels in Ratio to be M.L and H.M. (mutate() and recode_factor())
Group according to Ratio.
Apply a transformation to log2 transform all Expression values and then shift all values to be centered on zero. (mutate()).
Assign this to onlyRatios.

Use this as a template:

protein.df %>% 
  select(Uniprot, starts_with(___), -ends_with(___)) %>%        # Calculate log2
  gather(___, ___, -___) %>%                                    # Gather
  filter(___ != ___) %>%                                        # Remove uninteresting H.L Ratio
  mutate(Ratio = recode_factor(___,                             # Relabel ratios to match the Int data frame
                               `___` = ___,
                               `___` = ___)) %>% 
  group_by(___) %>%                                             # group according to ratios (2 groups)
  mutate(Expression = log2(___),                                # log2 transform
         Expression = ___ - mean(___, na.rm = T)) -> ___       # Apply shift

17.4 Process Significance Values and Merge

Exercise 17.4 Using the dplyr functions, fill in the blanks to:

Select Uniprot and all columns that end in Sig. (select())
Make a tidy data set with three columns Uniprot, Ratio and Significance (gather())
Remove all observations where Ratio is Ratio.H.L. (filter())
Rename the levels in Ratio to be M.L and H.M. (mutate() and recode_factor())
In the same function, make a new variable SigCat that cuts up the Significance variable into groups according to c(-Inf, 1e-11, 1e-4, 0.05, Inf) and labels them as c("<1e-11", "<0.0001", "<0.05", "NS").
Merge all this with the onlyRatios data frame (full_join())
Merge all this with the onlyInt data frame (full_join())
Remove any incomplete observations (i.e. with an NA anywhere) and where Uniprot is empty. (filter() and complete.cases(.))
Arrange in descending order of Significance. (arrange() and desc())
Assin to the object allData

Use this as a template:

protein.df %>% 
  select(___, ends_with(___)) %>%                              # Take columns of interest
  gather(___, ___, -___) %>%                                   # Gather
  filter(___ != ___) %>%                                       # Remove uninteresting H.L Ratio
  mutate(Ratio = recode_factor(___,                            # Relabel ratios to match the Int data frame
                               `___` = ___,
                               `___` = ___),
         SigCat = cut(___,                                     # Make colour labels for sig values
                      c(-Inf, 1e-11, 1e-4, 0.05, Inf),
                      c("<1e-11", "<0.0001", "<0.05", "NS"))) %>%
  full_join(___) %>%                                           # Merge with the log2 ratios
  full_join(___) %>%                                           # Merge with the Intensities
  filter(complete.cases(.), ___ != ___) %>%                    # Take only observations that have complete data and non-empty Uniprot
  arrange(___(___)) -> ___                                     # Order according to sig so that low sig are plotted first

17.5 Make a plot

Fill in the details to produce the following plot:

ggplot(______, aes(______, ______, col = ______)) +
  geom_point(alpha = 0.5, shape = 16) +
  scale_colour_manual(limits = c("<1e-11", "<0.0001", "<0.05", "NS"),
                      values = c("red", "orange", "blue", "grey30")) +
  facet_grid(. ~ ______) +
  labs(x = "Log2 Ratios", y = "Intensity", col = "p-value")

Chapter 17 Protein.df Revisited

17.1 Import and Clean Data

17.2 Process Intensities

17.3 Process Ratios

17.4 Process Significance Values and Merge

17.5 Make a plot

Chapter 17 `Protein.df` Revisited