Chapter 5 Element 1: Reproducible Research

5.1 Learning Objectives

All the steps of your data analysis need to be documented in a human readable format for transparent and reproducible research. In this section you’ll learn about:

  • R markdown
  • Generating word, HTML, and pdf documents

Reproducible research refers to academic research papers containing the full computational environment used to produce the results. This includes not only commentary, but also the data and code that can be used to reproduce the results and create new work based on the research.

The7 goal of reproducible research is to document your data analysis, making it easier for others to see exactly what you’ve done and to reproduce it.

5.2 Scripting

In the previous section we conducted our analysis by generating a script. That’s simply a text document with a series of commands. The script acts like a function: it’s a piece of standard code which can be reused. Writing scripts serves two purposes. First, it allows your data analysis to be transparent and reproducible, and second, it allows you to save time by not repeating the same steps over and over again.

To make your scripting more flexible, especially when you will run the same script on different input files, refer to the functions in table 5.1

Table 5.1: Functions for scripting.
Function Description
source() Run a series of commands contained in a plain text file as an R script.
list.files() List all files in the working directory.
file.choose() Ask user to choose a file through a graphical interface.
select.list() Ask user to choose a value from a list.
sink() Direct screen output to a file.
sink(file=NULL) Redirect output to the screen instead of a file.
png() Direct graphical output to a .png file.
pdf() Direct graphical output to a .pdf file.
dev.off() Stop directing graphical output to a file.

5.3 R Markdown and the knitr Package

The gold standard for reproducible8 research is interweaving the raw data, the analysis and your interpretation of the data into a single document.

Three tools are necessary for reproducible research of this type. First, R provides the foundation for analysing and visualising your data. In this way, all steps of the analysis are documented. Second, markdown, a literate programming language, is used for type-setting documents. For our purposes, we will use R Markdown, which serves as a gentle introduction to the topic. Third, the knitr package, allows you to “knit” R code, and its output, with text descriptions.

5.4 R Markdown

Exercise 5.1 (R Markdown) In the repository you'll find an HTML file that was compiled from an R markdown document (described below) using the commands from the previous chapter with the chickwt data-set. Your task is to reproduce this document using your own R Markdown file.

There are three components to an R markdown file:

  1. The YAML header
  2. Non-code commentary
  3. Code chunks

5.5 The YAML Header

Every markdown file must begin with a YAML header, which specifies how the document should be compiled.9 The minimum YAML header contains:10

---
output: html_document
---

This simply means we want an HTML document (i.e. a stand-alone web-page), but you can also specify word_document or pdf_document11 instead. In addition, you can include more information in the header, e.g.:

---
title: "Chick Weight Analysis"
author: "Rick Scavetta"
date: "03/09/2018"
output: html_document
---

These additional fields specify contents in the header of the output document, regardless of the output type (Word, HTML, PDF) specified.

Exercise 5.2 (R Syntax) Open a new blank document in RStudio and change the type in the lower right corner to R Markdown. Begin your document by typing in your YAML header.

5.6 Non-code Commentary

This is where markdown comes into play. Mark down is a simplified version of a mark up language. For example, HTML and LaTeX are markup languages which can be tedious to write or difficult to learn. In contrast, markdown is both easy to write and learn! Compare some common commands in table 5.2.

Table 5.2: Common LaTeX and markdown formatting commands. $...$ denotes “math mode” where you can enter equations in place of ..., which is the same in LaTeX as in markdown.
LaTeX Markdown Result
\textit{} *italics* or _italics_ italics
\textbf{} **bold** or __bold__ bold
$E = mc^2$ \$E = mc^2$ \(E = mc^{2}\)
$CO_2$ $CO_2$ \(CO_2\)
\\ 2 spaces Start a new line.

Some commands are actually the same, but because there are so many more options in LaTeX, it can be overwhelming to learn. For example, to get a bulleted list, such as:

  • item 1
  • item 2
  • item 3

is pretty straight-forward in markdown:

- item 1
- item 2
- item 3

But in LaTeX it would be written as:

\begin{itemize}
  \item item 1
  \item item 2
  \item item 3
\end{itemize}

In addition, you can also use # to add section headings, where the number of # denotes the level of header. Note the distinction to # as a comment character in R!

If you can’t figure out how to do something in markdown, you can still use LaTeX, but you shouldn’t need to. If you want to use only LaTeX, that is also possible. For example, This book was written with bookdown, an implementation of markdown.

Exercise 5.3 (Headers) Add section headers and some text to your document as per the template document.

5.7 Code Chunks

R code appears in code chunks. All code chunks have the basic structure:12

```{r}
```

This is like a signal to tell the compiler that the non-code commentary is over, and this part should be processed as R code.13

When the document is compiled, each chunk is executed sequentially. For example:

```{r}
log2(8)
```

will produce:

# [1] 3

as output after compilation. There are a variety of chunk options, which control how each chunk will be handled. Table 5.3 lists some of the most common chunk options.

The chunk name is the first argument, and is not explicitly named. Character means any alpha-numeric combination. Logical means either TRUE or FALSE (see table 7.1).

Table 5.3: Most commonly used chunk options.
Option Type Description
Position 1 Unquoted character Name of the chunk
echo Logical Display the code
eval Logical Execute the code
cache Logical Cache the results
message Logical Show regular messages
warning Logical Show warning messages
error Logical Show error messages

For example, the following chunk is called calcLog and will show only the output:

```{r calcLog, echo = FALSE, eval = TRUE}
log2(8)
```

This chunk will only show the code, but won’t calculate anything:

```{r calcLog, echo = TRUE, eval = FALSE}
log2(8)
```

Chunk options may be defined globally, which means that all chunks will have the same options set. This is done by calling

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

in your first chunk. This calls the function opts\_chunk\$set\{\} within (i.e. ::) the knitr package and sets the argument echo to FALSE. That means for every chunk, the code will not be shown. Now

```{r}
log2(8)
```

will just produce the output and not show the code:

# [1] 3

Chunk options can also be set locally. Local chunk options are specified in the chunk itself and override the global settings. For example

```{r, echo = FALSE}
log2(8)
```

will hide the commands, even if the global option is set to TRUE.

5.8 In-line Code

Code chunks, as defined above, enter a blank line before and after the code. If you want to integrate in-line code, i.e. in the middle of a sentence, you can use the short cut

`r ...`

replacing the … with your R code. This works best when the output is a vector (see @ref{(sub:vect)).

5.9 Tables

There are several functions that can convert data sets or the results of statistical tests into nicely formatted tables. The style sometimes depends on the output type you’ve specified in the YAML header. For LaTeX, I prefer xtable, which allows for subtle control. For markdown, the easiest entry points are the pander() function in the pander package, or the kable() function in the knitr package. For HTML, the DT package is the most flexible option.

Sometimes you’ll want to generate a table manually. For this, tablesgenerator.com allows you to format text into LaTeX or markdown - just copy and past the results into your document.

Exercise 5.4 (Produce a document) Add and complete code chunks to reproduce the document provided during the workshop.

5.10 The knitr package

R markdown files are saved using the .Rmd extension.14 knitr (Think “knit-R”.) is the package which assembles the three parts of an R markdown document. Running knitr::knit() on your file will produce the output file, but the easiest, and most common way, is to simply use the “knit” button in RStudio.

15

As a last note, you can produce an output file from a regular R script using knitr::spin() or using the “notebook” button in RStudio.

The following web-sites provide a good starting point for using knitr and R Markdown:

  • knitr: Elegant, flexible and fast dynamic report generation with R
  • Using R Markdown with RStudio

  1. As the number of biological studies using large data-sets continues to rise, not only is transparent and reproducible data analysis increasingly important, but it is also becoming obligatory.

  2. For example, this reference book is written in markdown and rendered in HTML so that it can be viewed as an interactive web-page.

  3. By compiled we mean the merging of non-code commentary and code chunks in a single output file.

  4. The name YAML is actually meaningless. Originally it was an acronym for Yet Another Markup Language, but it then became the recursive acronym YAML Ain’t Markup Language, to distinguish its purpose as data-oriented, as opposed to document markup.

  5. You will need to have LaTeX installed to compile markdown to a pdf document.

  6. R code is sandwiched between these codes.

  7. The R inside the opening of the code chunk is necessary, since it’s possible to use other languages within the same document.

  8. If using LaTeX, it will be an R sweave file with a .Rnw extension.

  9. knitr is a newer and easier-to-use version of Sweave which has been around since the early days of R. S is the language from which R is derived, hence we have S- weave and knit -R.