# Chapter 4 Case Study: Plant Growth

**Exercise 4.1 (R Syntax)**Start building a script by entering the commands into the text editor and then executing them in the console (press

`ctrl`

+ `enter`

). Commands will appear after the `>`

sign.
`PlantGrowth`

is a built-in data-set of plant dry-weight. If you execute `PlantGrowth`

in the console, you’ll see the entire data-set. The group column is a categorical variable, describing if dry-weight was measured under control (`ctrl`

) or one of two different treatment (`trt1`

and `trt2`

) conditions. This experimental set-up is probably familiar to you, even if you are not studying plant growth.

We already have access to the data set, in the form of a `data frame`

. To make it easier to print to the screen, we’re going to convert it to a *tibble*.

In many situations in statistics, *categorical variables* are called *factors*. Within a factor, each *category*, or *group*, is called a *level*. We can find out how many levels we have and what they are by using the `nlevels()`

and `levels()`

functions on the growth column:

`# [1] 3`

`# [1] "ctrl" "trt1" "trt2"`

Factors are very convenient because we can use them to split other columns. This is a very simple and powerful technique for handling data. To get a sense of the data’s distribution, we can draw a variety of plots, sorting each weight value according to the levels of the growth factor (i.e. the three growth conditions). The result is displayed in figure **??**.

R specializes in handling and visualising data, so making the boxplot is straight-forward in `base`

R. However, there are many new packages which have made things easier and more consistent, like `ggplot2`

, which is included in the `tidyverse`

suite. It offers fine control of making elegent statistical plots.

In the `ggplot()`

function below, `weight`

is `mapped`

on the `x`

scale, and `weight`

is *mapped* onto the `y`

scale. Collocualy, we would way, “*weight* against *group*”, or “*weight* as described by *group*”:

Here is our box plot

```
p +
geom_jitter(width = 0.2, shape = 16, alpha = 0.75) +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), col = "red")
```

We can get descriptive statistics on all the values in the `weight`

column,

`# [1] 5.073`

but group-wise statistics are more appropriate. You can imagine that statisticians have been doing such things for a long time. Let’s use some functions that come with the `dplyr`

package, also included in the `tidyverse`

suite.^{6}

```
# # A tibble: 3 x 3
# group avg stdev
# <fct> <dbl> <dbl>
# 1 ctrl 5.03 0.583
# 2 trt1 4.66 0.794
# 3 trt2 5.53 0.443
```

As you can see from the example above, we can just keep adding to the `summarise`

function – that’s really convenient! `%>%`

is called the *pipe operator*, since we can *pipe* many functions together, as shown here. When you read this function in your head, pronounce it as such: "Take `PlantGrowth`

*and then* group it by the `group`

variable *and then* summarize it into two columns: `avg`

, which contains group-wise means and `stdev`

which contains group-wise standard deviations of the `weight`

values. {#sec:pipe}

In the above plot, `stat_summary()`

calculated and plotted the group-wise mean and standard deviation for us. Now that we’ve calculated those values ourselves, let’s plot them directly. First, we need to save the output from above. If we just print it to the screen, there is no object to work on! Here, we’ll use a little trick that works well in combination with the `%>%`

operator: the “reverse assign” operator `->`

. We only use this in the specific case of also using `%>%`

.

```
PlantGrowth %>%
group_by(group) %>%
summarise(avg = mean(weight),
stdev = sd(weight)) -> PlantGrowthSummary
```

Now I have an object that contains only my summary statistics. So let’s plot it!

```
ggplot(PlantGrowthSummary, aes(group, avg)) +
geom_pointrange(aes(ymin = avg - stdev, ymax = avg + stdev))
```

You can probably guess what the new parts of this plotting function are doing. I didn’t have to assign the *base* layers of the plot to an object, `p`

above, because I just used it once and then it was finished.

Notice that `PlantGrowthSummary`

only has three rows, since calculating the mean is an aggregration function, i.e. it *aggergrates* values to (typically) a single value. A Z-score is something quite different. This is a transformation, in which each value is *transformed* in some way (i.e. *normalized*). Again `dplyr`

offers a solution:

```
# # A tibble: 30 x 3
# # Groups: group [3]
# weight group Z_score[,1]
# <dbl> <fct> <dbl>
# 1 4.17 ctrl -1.48
# 2 5.58 ctrl 0.940
# 3 5.18 ctrl 0.254
# 4 6.11 ctrl 1.85
# 5 4.5 ctrl -0.912
# 6 4.61 ctrl -0.724
# 7 5.17 ctrl 0.237
# 8 4.53 ctrl -0.861
# 9 5.33 ctrl 0.511
# 10 5.14 ctrl 0.185
# # … with 20 more rows
```

Applying a linear model is also straight-forward. Consider a one-way ANOVA:

This function calculate an ANOVA table on the linear model of `weight`

described by `group`

and saves it as an object called `Plant.anova`

. If you type `Plant.anova`

into the console, you will get a results table.

Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|

group | 2 | 3.76634 | 1.8831700 | 4.846088 | 0.01591 |

Residuals | 27 | 10.49209 | 0.3885959 | NA | NA |

Since F (4.85) is significant (p-value = 0.016) at the 5% level on the F(2, 27) distribution, we can reject the null hypothesis. This short workflow has already demonstrated R’s three key strengths: data manipulation, statistical modelling, and data visualisation.

## 4.1 Saving your Workspace

The R workspace refers to all objects you have defined in your current session. Functions for saving and loading the workspace, referred to as an *image*, are given in the following table.

Area | Function | Description |
---|---|---|

Workspace | `ls()` |
List all objects in the workspace. |

`rm(list=ls())` |
Delete all objects in the workspace. | |

`save.image()` |
Save the whole workspace image to a file. | |

Commands | `history(n)` |
Display the last n entries in the history. |

`savehistory()` |
Save history as a file. | |

`loadhistory()` |
Load a file as history. | |

Objects | `saveRDS()` |
Save a single object to a file. |

`save()` |
Save multiple objects to a file. | |

`load()` |
Load a workspace image or R object. | |

`dput()` |
Save an ASCII version of an R object. | |

`dget()` |
Load an ASCII version of an R object. |

When saving your workspace, only objects in your environment are saved. The working directory, open graphics windows or loaded libraries (i.e. the current session state) are saved.

When you exit R, you will be asked to save the workspace image (i.e. environment) as `.RData`

. **Don’t do this!**

The best practice is to start fresh with each R session and use well-commented scripts or markdown documents for managing your data analysis workflow. You may forget what you were working on by the next time you start R and, in the worst-case-scenario, you could have objects hanging around your environment which you’re not even sure where they came from.

You’ll probably encounter some of the following functions when looking for help online or if you inherit scripts from colleagues. They are outdated and should be avoided!

Area | Function | Description |
---|---|---|

Working Directory | `getwd()` |
Check the working directory. |

`setwd()` |
Change the working directory. | |

Objects | `attach()` |
“Attach” an object to the environment. |

`detact()` |
“Detach” an object to the environment. |

You’ll still see lots of old-school methods, namely the

`apply`

family of functions, from your colleagues and in online help forums.↩︎