Chapter 2 Introduction

2.1 Learning Objectives

Learning a programming language is like learning any new language. Here, our goal is to:

  • Develop a solid understanding of the basic grammar and vocabulary, and
  • develop your skills in a goal-oriented way, i.e. understand the question your analysis needs to answer.

Before starting, it is worth considering what we mean by data analysis.

2.2 What is Data Analysis?

Biology is no longer a “soft science”. Given the sheer amount of data, and the ease with which it is obtained, bench biologists must become proficient at analysing their own data.

We define data analysis as extracting knowledge from information.

Although information is not knowledge, knowledge can be obtained from information through data analysis. Scientists must not only ask the right questions, but they also need the tools to answer those questions. Importantly, familiarity with data analysis tools allows scientists to ask new and interesting questions, because they know how to answer them.

2.3 What is Reproducible Research?

In its simplest form, Reproducible Research means that your data analysis workflow can be repeated by other scientists. This relies on using transparent data analysis methods so that your workflow can be reproduced, understood and verified.1 R is a powerful tool for reproducible research because it is text-based, so the exact steps in your workflow can be documented with comments. We’ll begin the workshop with a simple scripting case-study and demonstrate how to make our workflow available to others (see section 5).

2.4 I already use Excel, why should I use R?

Microsoft Excel is suitable for analysing small data-sets, using simple statistics, and for formatting page layouts. However, Excel has many short-comings when dealing with large data-sets (either many observations or many variables) or even small data-sets that must be sorted, filtered, grouped or compared. Many of the data analysis techniques you will learn in this workshop would be very difficult, if not impossible, to reproduce in Excel. That being said, if you have a quick & dirty analysis you need to carry out and you can accomplish it quickly & easily in Excel, go for it! Use the tool which allows you to get the fastest result. Hopefully, you’ll realise that for more complex analysis, the answer will not be Excel.

2.5 The Elements of Data Analysis

The focus of this workshop is to teach R to bench biologists by getting them to think like computational biologists. The primary reason for wanting to achieve this is that bench biologists and computational biologists often seem to speak different languages. By learning some of the tools that computational biologists use, you can bridge this communication gap while at the same time empowering yourself to do your own analysis.

The 10 key elements of this Data Analysis workshop are:

Part 1 The Basic Vocabular and Grammar

  1. Reproducible Research
  2. Functions
  3. Objects

Part 2 Building Sentences

  1. Logical Expressions
  2. Indexing

Part 3 Paragraphs

  1. Factor Variables
  2. Tidy Data
  3. Split-Apply-Combine

Part 4 The Details

  1. Regular Expressions
  2. Control Structures

  1. Scripting is the first step towards fully-fledged Reproducible Research, which integrates reporting and analysis. For a thorough list of resources see the Reproducible Research taskview on CRAN and tips for getting started in section 5.