Chapter 4 Intro Case Study
ctrl
+ enter
). Commands will appear after the >
sign.
PlantGrowth
PlantGrowth
is a built-in data-set of plant dry-weight. If you execute PlantGrowth
in the console, you’ll see the entire data-set. The group column is a categorical variable, describing if dry-weight was measured under control (ctrl
) or one of two different treatment (trt1
and trt2
) conditions. This experimental set-up is probably familiar to you, even if you are not studying plant growth.
We already have access to the data set, in the form of a data frame
. To make it easier to print to the screen, we’re going to convert it to a tibble.
<- as_tibble(PlantGrowth) PlantGrowth
In many situations in statistics, categorical variables are called factors. Within a factor, each category, or group, is called a level. We can find out how many levels we have and what they are by using the nlevels()
and levels()
functions on the growth column:
nlevels(PlantGrowth$group)
## [1] 3
levels(PlantGrowth$group)
## [1] "ctrl" "trt1" "trt2"
Factors are very convenient because we can use them to split other columns. This is a very simple and powerful technique for handling data. To get a sense of the data’s distribution, we can draw a variety of plots, sorting each weight value according to the levels of the growth factor (i.e. the three growth conditions). The result is displayed in figure ??.
R specializes in handling and visualising data, so making the boxplot is straight-forward in base
R. However, there are many new packages which have made things easier and more consistent, like ggplot2
, which is included in the tidyverse
suite. It offers fine control of making elegent statistical plots.
In the ggplot()
function below, weight
is mapped
on the x
scale, and weight
is mapped onto the y
scale. Colloquially, we would way, “weight against group,” or “weight as described by group”:
<- ggplot(PlantGrowth, aes(group, weight)) p
Here is our box plot
+
p geom_boxplot()
+
p geom_jitter(width = 0.2, shape = 16, alpha = 0.75) +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), col = "red")
## Warning: Computation failed in `stat_summary()`:
## The `Hmisc` package is required.
We can get descriptive statistics on all the values in the weight
column,
mean(PlantGrowth$weight)
## [1] 5.073
but group-wise statistics are more appropriate. You can imagine that statisticians have been doing such things for a long time. Let’s use some functions that come with the dplyr
package, also included in the tidyverse
suite.5
%>%
PlantGrowth group_by(group) %>%
summarise(avg = mean(weight),
stdev = sd(weight))
## # A tibble: 3 × 3
## group avg stdev
## <fct> <dbl> <dbl>
## 1 ctrl 5.03 0.583
## 2 trt1 4.66 0.794
## 3 trt2 5.53 0.443
As you can see from the example above, we can just keep adding to the summarise
function – that’s really convenient! %>%
is called the pipe operator, since we can pipe many functions together, as shown here. When you read this function in your head, pronounce it as such: "Take PlantGrowth
and then group it by the group
variable and then summarize it into two columns: avg
, which contains group-wise means and stdev
which contains group-wise standard deviations of the weight
values. {#sec:pipe}
In the above plot, stat_summary()
calculated and plotted the group-wise mean and standard deviation for us. Now that we’ve calculated those values ourselves, let’s plot them directly. First, we need to save the output from above. If we just print it to the screen, there is no object to work on! Here, we’ll use a little trick that works well in combination with the %>%
operator: the “reverse assign” operator ->
. We only use this in the specific case of also using %>%
.
%>%
PlantGrowth group_by(group) %>%
summarise(avg = mean(weight),
stdev = sd(weight)) -> PlantGrowthSummary
Now I have an object that contains only my summary statistics. So let’s plot it!
ggplot(PlantGrowthSummary, aes(group, avg)) +
geom_pointrange(aes(ymin = avg - stdev, ymax = avg + stdev))
You can probably guess what the new parts of this plotting function are doing. I didn’t have to assign the base layers of the plot to an object, p
above, because I just used it once and then it was finished.
Notice that PlantGrowthSummary
only has three rows, since calculating the mean is an aggregration function, i.e. it aggergrates values to (typically) a single value. A Z-score is something quite different. This is a transformation, in which each value is transformed in some way (i.e. normalized). Again dplyr
offers a solution:
%>%
PlantGrowth group_by(group) %>%
mutate(Z_score = scale(weight))
## # A tibble: 30 × 3
## # Groups: group [3]
## weight group Z_score[,1]
## <dbl> <fct> <dbl>
## 1 4.17 ctrl -1.48
## 2 5.58 ctrl 0.940
## 3 5.18 ctrl 0.254
## 4 6.11 ctrl 1.85
## 5 4.5 ctrl -0.912
## 6 4.61 ctrl -0.724
## 7 5.17 ctrl 0.237
## 8 4.53 ctrl -0.861
## 9 5.33 ctrl 0.511
## 10 5.14 ctrl 0.185
## # … with 20 more rows
Applying a linear model is also straight-forward. Consider a one-way ANOVA:
<- lm(weight ~ group, data=PlantGrowth)
Plant.lm <- anova(Plant.lm) Plant.anova
This function calculate an ANOVA table on the linear model of weight
described by group
and saves it as an object called Plant.anova
. If you type Plant.anova
into the console, you will get a results table.
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
group | 2 | 3.76634 | 1.8831700 | 4.846088 | 0.01591 |
Residuals | 27 | 10.49209 | 0.3885959 | NA | NA |
Since F (4.85) is significant (p-value = 0.016) at the 5% level on the F(2, 27) distribution, we can reject the null hypothesis. This short workflow has already demonstrated R’s three key strengths: data manipulation, statistical modelling, and data visualisation.
4.1 Saving your Workspace
The R workspace refers to all objects you have defined in your current session. Functions for saving and loading the workspace, referred to as an image, are given in the following table.
Area | Function | Description |
---|---|---|
Workspace | ls() |
List all objects in the workspace. |
rm(list=ls()) |
Delete all objects in the workspace. | |
save.image() |
Save the whole workspace image to a file. | |
Commands | history(n) |
Display the last n entries in the history. |
savehistory() |
Save history as a file. | |
loadhistory() |
Load a file as history. | |
Objects | saveRDS() |
Save a single object to a file. |
save() |
Save multiple objects to a file. | |
load() |
Load a workspace image or R object. | |
dput() |
Save an ASCII version of an R object. | |
dget() |
Load an ASCII version of an R object. |
When saving your workspace, only objects in your environment are saved. The working directory, open graphics windows or loaded libraries (i.e. the current session state) are saved.
When you exit R, you will be asked to save the workspace image (i.e. environment) as .RData
. Don’t do this!. If you already exited RStudio and did happen to save the workspace image, the easiest thing to do is open the RStudio project but selecting it from the recent projects in the project pull down menu in the upper right corner of the IDE inteface or by opening (i.e. double-clicking) on the *.Rproj file in your project folder. If you have objects in your environment and haven’t changed any of the default settings in RStudio, it came from the .Rdata
. You may not see this file in your OS file listings unless you change some setting, since it’s a hidden file (we can tell from the name, hidden files begin with a .
) But it should be visible in the RStudio Files pane (lower-right). Select this file by clicking on the adjacent check-box, and delete it by clicking on Delete in the menu bar at the top of the Files pane.
The best practice is to start fresh with each R session and use well-commented scripts or markdown documents for managing your data analysis workflow. You may forget what you were working on by the next time you start R and, in the worst-case-scenario, you could have objects hanging around your environment which you’re not even sure where they came from.
You’ll probably encounter some of the following functions when looking for help online or if you inherit scripts from colleagues. They are outdated and should be avoided!
Area | Function | Description |
---|---|---|
Working Directory | getwd() |
Check the working directory. |
setwd() |
Change the working directory. | |
Objects | attach() |
“Attach” an object to the environment. |
detact() |
“Detach” an object to the environment. |
You’ll still see lots of old-school methods, namely the
apply
family of functions, from your colleagues and in online help forums.↩︎