Chapter 22 Case Study: Peak-calling in a BAM File
22.1 Learning Objectives
By the end of this chapter you should be able to:
- Read in insert sizes from a BAM file.
- Identify features and create a generalised function.
22.2 Distribution of insert sizes
In this case study we’ll investigate the size distribution of sequenced RNA fragments using a BAM file as input. Each read is represented as a separate entry with information on its mapped genomic coordinates, name and position of the mate (a read coming from the opposite side of the DNA fragment) and inferred size of the entire fragment. We are only interested in the size (“isize”, standing for “insert size”).
We will create a generic plotting function to help us analyse this data.
22.3 Read in the data
To read in a BAM file, we’ll use the Rsamtools
package, which is part of the large repository of bioinformatics-related packages stored in the Bioconductor repository. Execute these commands to install the package, then initialize it in the usual way, with library()
.
Rsamtools::scanBam()
allows us to specify what parameters we are interested in, so that we don’t have to read in the whole BAM file, which is very large. The sample.bam
file is available for download here. Download the file into a new project directory and execute the following commands to read in the file.
# Define the parameters to import
parameters <- ScanBamParam(what = "isize")
# Load only the specified parameters from the BAM file
data <- scanBam("sample.bam", param = parameters)
as.vector()
.
In this data set there are 60336066 raw observations, 30172861 are less than 0 and 30163205 are greater than or equal to 0.
raw_size
for only the sizes which are above zero. Confirm that it is the expected length (see above).
Exercise 22.3 The plyr::count()
function summarises the unique values in a vector (or data frame). Use this function on size
to get the size distribution and save this as an object called dist
.
Note the distinction to dplyr
! If you load both dplyr
and the older plyr
in the same environment, you may encounter conflicts. It’s best to just access specific useful functions in plyr
using the ::
notation.
dist
is a data.frame
with two columns: x and freq. These will be automatically taken as the input for the x and y scales when calling the base package plot()
function, as shown in fig. ??.
Exercise 22.5 Define a new custom function (myPlot()
) that takes three arguments (1 essential, and 2 optional): x, xRange = c(0,1000), lines = NULL
. Your function should do three things:
- Filter
x
according to the providedxRange
. We will usedist
forx
when we call the function. - Plot
x
according to the plot type specified intype
(e.g. “peaks”, “troughs”). - Add vertical lines and labels for their positions if a vector is provided under the
lines
argument.
Once you have established your custom function. You should be able to produce the following plots.