Chapter 10 Element 5: Indexing
10.1 Learning Objectives
By the end of this chapter you should be familiar with:
- Indexing using
[]
, and - Base package plotting functions
We already encountered a type of indexing when we used the subset()
function and logical expressions. Here, we will go one step further and perform indexing without using any unnecessary functions. We can filter our data-set according to:
- Equations (i.e. using logical expressions, see section 9
- Position (i.e. using row & column number, discussed here.), or
- Text matching (i.e. using regular expressions).
10.2 Indexing vectors (1D)
The beauty of indexing is that we can use []
notation for all of these types of questions. We can limit our investigation to specific data points by using the [x]
notation to select the \(x{}^{th}\) data point in a vector. Let’s return to the simple foo1
example.
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
# [1] 36
# [1] 6
# [1] 36
# [1] 15 22 29 36
# [1] 15 22 29 36 43 50 57 64 71 78 85 92 99
This is convenient, but becomes very powerful when we start mining our data by combining position with logical expressions. Recall that the result of a logical expression is a TRUE/FALSE answer, which can be used as an index.
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# [11] FALSE FALSE FALSE FALSE FALSE
# Save the logical expression as an object and use it
# as an index to report only the TRUE results
m <- foo1 < 50
foo1[m]
# [1] 1 8 15 22 29 36 43
# [1] 1 8 15 22 29 36 43
# Combine logical expressions to find values:
# Either less than or equal to 22 or greater than 71
foo1 <= 22 | foo1 > 71
# [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [11] FALSE TRUE TRUE TRUE TRUE
# [1] 8
# [1] 1 8 15 22 78 85 92 99
10.3 Indexing Dataframes (2D)
Recall that each column in a data frame is the same as a vector. Returning to the foo.df
data frame we defined earlier, let’s try to extract the same information using logical expressions instead of the subset()
function.
# healthy tissue quantity
# 1 TRUE Liver 1
# 2 FALSE Brain 7
# 3 FALSE Testes 13
# 4 TRUE Muscle 19
# 5 TRUE Intestine 25
# 6 FALSE Heart 31
# [1] Liver Brain Testes Muscle Intestine Heart
# Levels: Brain Heart Intestine Liver Muscle Testes
# [1] TRUE FALSE FALSE FALSE FALSE FALSE
# [1] 1
# How many Liver samples are there?
# TRUE values = 1, so take the sum.
sum(foo.df$tissue == "Liver")
# [1] 1
Notice that we can combine an index of one column (i.e. foo.df\$healthy == TRUE
) to extract values from another (foo.df\$quantity
).
# [1] 1 19 25
To really make use of indexing with data frames, we can use [x,y]
notation to select the $x{}^{th}$
row (i.e. observation) of the $y{}^{th}$
column (i.e. variable).
# healthy tissue quantity
# 4 TRUE Muscle 19
# [1] Muscle
# Levels: Brain Heart Intestine Liver Muscle Testes
10.4 Exercises for Indexing
With the tools we have introduced so far, you will be able to start data mining. Let’s begin by extracting some interesting information from the protein.df
data frame. First, let’s return to the exercises from the previous chapters and see if you can answer these questions in an easier way now:
Exercise 10.1 (Find protein values) Given a list of Uniprot IDs:
- GOGA7
- PSA6
- S10AB
filter()
or subset()
, use []
instead.
filter()
or subset()
, use []
instead.
Exercise 10.3 (Find extreme values)
For the H/M ratio column, create a new data frame containing only proteins that have a \(log_{2}\) ratio above 2.0 or below -2.0. Again, try to determine this without creating a new data set or using the subset()
function.
Exercise 10.4 (Find top 20 values) Which proteins (i.e. Uniprot IDs) have the 20 highest \(log_{2}\) H/M and M/L ratios?
The following sections will help you to answer the previous exercises
10.5 Ordering functions
There are a couple different ways to think about sorting data. Don’t confuse the following functions:
Function | Description |
---|---|
sort() |
Returns a sorted vector (ascending or descending). Calls order() under-the-hood. |
order() |
Returns an index (integer vector) of the position of the ordered values (ascending or descending). Use this for data frames. Allows ordering on multiple vectors. |
rank() |
Returns the ranks of values in a vector, e.g. in non-parametric tests. |
arrange() |
Part of the tidyverse . Rearranges a variable |
# [1] "Liver" "Brain" "Testes" "Muscle" "Intestine"
# [6] "Heart"
# [1] "Brain" "Heart" "Intestine" "Liver" "Muscle"
# [6] "Testes"
# [1] 2 6 5 1 4 3
# [1] "Brain" "Heart" "Intestine" "Liver" "Muscle"
# [6] "Testes"
# healthy tissue quantity
# 1 TRUE Liver 1
# 2 FALSE Brain 7
# 3 FALSE Testes 13
# 4 TRUE Muscle 19
# 5 TRUE Intestine 25
# 6 FALSE Heart 31
# healthy tissue quantity
# 1 FALSE Brain 7
# 2 FALSE Heart 31
# 3 TRUE Intestine 25
# 4 TRUE Liver 1
# 5 TRUE Muscle 19
# 6 FALSE Testes 13
# healthy tissue quantity
# 1 FALSE Testes 13
# 2 TRUE Muscle 19
# 3 TRUE Liver 1
# 4 TRUE Intestine 25
# 5 FALSE Heart 31
# 6 FALSE Brain 7
10.6 Intersection functions
To answer the last exercise you’ll need to know a bit about combining data. We saw merge functions in the section on data frames (page 7.4). Here, is another set of functions which is very useful: the intersect family. Given vectors x
and y
:
Function | Description |
---|---|
intersect(x, y) |
Values in both vectors x and y . |
setdiff(x, y) |
Values in vector x , which are not in y . |
setdiff(y, x) |
Values in vector x , which are not in y . |
union(x, y) |
Set of all unique values in vectors x and y . |
# [1] 5 6 7 8
# [1] 1 2 3 4
# [1] 9 10 11 12
# [1] 1 2 3 4 5 6 7 8 9 10 11 12