Chapter 10 Element 5: Indexing

10.1 Learning Objectives

By the end of this chapter you should be familiar with:

Indexing using [], and
Base package plotting functions

We already encountered a type of indexing when we used the subset() function and logical expressions. Here, we will go one step further and perform indexing without using any unnecessary functions. We can filter our data-set according to:

Equations (i.e. using logical expressions, see section 9
Position (i.e. using row & column number, discussed here.), or
Text matching (i.e. using regular expressions).

10.2 Indexing vectors (1D)

The beauty of indexing is that we can use [] notation for all of these types of questions. We can limit our investigation to specific data points by using the [x] notation to select the $x{}^{th}$ data point in a vector. Let’s return to the simple foo1 example.

foo1         # Our data-set

#  [1]  1  8 15 22 29 36 43 50 57 64 71 78 85 92 99

foo1[6]    # The value at position 6

# [1] 36

p            # An object to use for indexing

# [1] 6

foo1[p]    # The value at position p

# [1] 36

foo1[3:p]  # The values between position 3 and p

# [1] 15 22 29 36

foo1[3:length(foo1)] # Position 3 to the end of foo1

#  [1] 15 22 29 36 43 50 57 64 71 78 85 92 99

This is convenient, but becomes very powerful when we start mining our data by combining position with logical expressions. Recall that the result of a logical expression is a TRUE/FALSE answer, which can be used as an index.

# For every value in foo1, is it less than 50?
foo1 < 50

#  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# [11] FALSE FALSE FALSE FALSE FALSE

# Save the logical expression as an object and use it
# as an index to report only the TRUE results
m <- foo1 < 50
foo1[m]

# [1]  1  8 15 22 29 36 43

# Use the logical expression as an index to report
# only the TRUE results
foo1[foo1<50]

# [1]  1  8 15 22 29 36 43

# Combine logical expressions to find values:
# Either less than or equal to 22 or greater than 71
foo1 <= 22 | foo1 > 71

#  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [11] FALSE  TRUE  TRUE  TRUE  TRUE

# How many are there:
sum(foo1 <= 22 | foo1 > 71)

# [1] 8

# What are they?
foo1[foo1 <= 22 | foo1 > 71]

# [1]  1  8 15 22 78 85 92 99

10.3 Indexing Dataframes (2D)

Recall that each column in a data frame is the same as a vector. Returning to the foo.df data frame we defined earlier, let’s try to extract the same information using logical expressions instead of the subset() function.

# Our data frame
foo.df

#   healthy    tissue quantity
# 1    TRUE     Liver        1
# 2   FALSE     Brain        7
# 3   FALSE    Testes       13
# 4    TRUE    Muscle       19
# 5    TRUE Intestine       25
# 6   FALSE     Heart       31

# The tissue variable (i.e. column, vector)
foo.df$tissue

# [1] Liver     Brain     Testes    Muscle    Intestine Heart    
# Levels: Brain Heart Intestine Liver Muscle Testes

# Which rows specify Liver?
# First make an index
foo.df$tissue == "Liver"

# [1]  TRUE FALSE FALSE FALSE FALSE FALSE

# Which values in the index are TRUE
which(foo.df$tissue == "Liver")

# [1] 1

# How many Liver samples are there?
# TRUE values = 1, so take the sum.
sum(foo.df$tissue == "Liver")

# [1] 1

Notice that we can combine an index of one column (i.e. foo.df\$healthy == TRUE) to extract values from another (foo.df\$quantity).

# The quantity values of only healthy observations
foo.df$quantity[foo.df$healthy == TRUE]

# [1]  1 19 25

To really make use of indexing with data frames, we can use [x,y] notation to select the $x{}^{th}$ row (i.e. observation) of the $y{}^{th}$ column (i.e. variable).

# The fourth row (i.e. observation)
foo.df[4,]

#   healthy tissue quantity
# 4    TRUE Muscle       19

# The fourth row, second column (i.e. variable)
foo.df[4,2]

# [1] Muscle
# Levels: Brain Heart Intestine Liver Muscle Testes

10.4 Exercises for Indexing

With the tools we have introduced so far, you will be able to start data mining. Let’s begin by extracting some interesting information from the protein.df data frame. First, let’s return to the exercises from the previous chapters and see if you can answer these questions in an easier way now:

Exercise 10.1 (Find protein values) Given a list of Uniprot IDs:

GOGA7
PSA6
S10AB

Find the corresponding $log_{2}$ ratios for each of the three conditions (H/M, M/L, H/L). Don’t use filter() or subset(), use [] instead.

Exercise 10.2 (Find significant hits) For the H/M ratio column, create a new data frame containing only proteins that have a p-value less than 0.05. Don’t use filter() or subset(), use [] instead.

Exercise 10.3 (Find extreme values) For the H/M ratio column, create a new data frame containing only proteins that have a $log_{2}$ ratio above 2.0 or below -2.0. Again, try to determine this without creating a new data set or using the subset() function.

Exercise 10.4 (Find top 20 values) Which proteins (i.e. Uniprot IDs) have the 20 highest $log_{2}$ H/M and M/L ratios?

Exercise 10.5 (Find intersections) Which proteins appear in the top twenty lists of both HM and ML?

The following sections will help you to answer the previous exercises

10.5 Ordering functions

There are a couple different ways to think about sorting data. Don’t confuse the following functions:

Table 10.1: Examples of some simple and frequently used functions for reordering data.
Function	Description
`sort()`	Returns a sorted vector (ascending or descending). Calls `order()` under-the-hood.
`order()`	Returns an index (integer vector) of the position of the ordered values (ascending or descending). Use this for data frames. Allows ordering on multiple vectors.
`rank()`	Returns the ranks of values in a vector, e.g. in non-parametric tests.
`arrange()`	Part of the `tidyverse`. Rearranges a variable

# A vector (or a variable in a data frame):
foo3

# [1] "Liver"     "Brain"     "Testes"    "Muscle"    "Intestine"
# [6] "Heart"

sort(foo3) # values after ordering

# [1] "Brain"     "Heart"     "Intestine" "Liver"     "Muscle"   
# [6] "Testes"

order(foo3) # index position after ordering

# [1] 2 6 5 1 4 3

foo3[order(foo3)] # i.e. sort()

# [1] "Brain"     "Heart"     "Intestine" "Liver"     "Muscle"   
# [6] "Testes"

foo.df

#   healthy    tissue quantity
# 1    TRUE     Liver        1
# 2   FALSE     Brain        7
# 3   FALSE    Testes       13
# 4    TRUE    Muscle       19
# 5    TRUE Intestine       25
# 6   FALSE     Heart       31

foo.df %>% 
  arrange(tissue)

#   healthy    tissue quantity
# 1   FALSE     Brain        7
# 2   FALSE     Heart       31
# 3    TRUE Intestine       25
# 4    TRUE     Liver        1
# 5    TRUE    Muscle       19
# 6   FALSE    Testes       13

foo.df %>% 
  arrange(desc(tissue))

#   healthy    tissue quantity
# 1   FALSE    Testes       13
# 2    TRUE    Muscle       19
# 3    TRUE     Liver        1
# 4    TRUE Intestine       25
# 5   FALSE     Heart       31
# 6   FALSE     Brain        7

10.6 Intersection functions

To answer the last exercise you’ll need to know a bit about combining data. We saw merge functions in the section on data frames (page 7.4). Here, is another set of functions which is very useful: the intersect family. Given vectors x and y:

Table 10.2: The `intersect()` family of functions.
Function	Description
`intersect(x, y)`	Values in both vectors `x` and `y`.
`setdiff(x, y)`	Values in vector `x`, which are not in `y`.
`setdiff(y, x)`	Values in vector `x`, which are not in `y`.
`union(x, y)`	Set of all unique values in vectors `x` and `y`.

xx <- 1:8
yy <- 5:12

intersect(xx, yy)

# [1] 5 6 7 8

setdiff(xx, yy)

# [1] 1 2 3 4

setdiff(yy, xx)

# [1]  9 10 11 12

union(xx, yy)

#  [1]  1  2  3  4  5  6  7  8  9 10 11 12