Chapter 8 2.1.4 Indexing

8.1 Learning Objectives

By the end of this chapter you should be familiar with:

Indexing using [], and

We already encountered a type of indexing when we used the subset() function and logical expressions. Here, we will go one step further and perform indexing without using any unnecessary functions. We can filter our data-set according to:

Equations (i.e. using logical expressions, see section ??
Position (i.e. using row & column number, discussed here.), or
Text matching (i.e. using regular expressions).

8.2 Indexing vectors (1D)

The beauty of indexing is that we can use [] notation for all of these types of questions. We can limit our investigation to specific data points by using the [x] notation to select the $x{}^{th}$ data point in a vector. Let’s return to the simple foo1 example.

foo1         # Our data-set

##  [1]  1  8 15 22 29 36 43 50 57 64 71 78 85 92 99

foo1[6]    # The value at position 6

## [1] 36

p            # An object to use for indexing

## [1] 6

foo1[p]    # The value at position p

## [1] 36

foo1[3:p]  # The values between position 3 and p

## [1] 15 22 29 36

foo1[3:length(foo1)] # Position 3 to the end of foo1

##  [1] 15 22 29 36 43 50 57 64 71 78 85 92 99

This is convenient, but becomes very powerful when we start mining our data by combining position with logical expressions. Recall that the result of a logical expression is a TRUE/FALSE answer, which can be used as an index.

# For every value in foo1, is it less than 50?
foo1 < 50

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE

# Save the logical expression as an object and use it
# as an index to report only the TRUE results
m <- foo1 < 50
foo1[m]

## [1]  1  8 15 22 29 36 43

# Use the logical expression as an index to report
# only the TRUE results
foo1[foo1<50]

## [1]  1  8 15 22 29 36 43

# Combine logical expressions to find values:
# Either less than or equal to 22 or greater than 71
foo1 <= 22 | foo1 > 71

##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE

# How many are there:
sum(foo1 <= 22 | foo1 > 71)

## [1] 8

# What are they?
foo1[foo1 <= 22 | foo1 > 71]

## [1]  1  8 15 22 78 85 92 99

8.3 Indexing Dataframes (2D)

Recall that each column in a data frame is the same as a vector. Returning to the foo_df data frame we defined earlier, let’s try to extract the same information using logical expressions instead of the subset() function.

# Our data frame
foo_df

## # A tibble: 6 × 3
##   healthy tissue    quantity
##   <lgl>   <chr>        <dbl>
## 1 TRUE    Liver            1
## 2 FALSE   Brain            7
## 3 FALSE   Testes          13
## 4 TRUE    Muscle          19
## 5 TRUE    Intestine       25
## 6 FALSE   Heart           31

# The tissue variable (i.e. column, vector)
foo_df$tissue

## [1] "Liver"     "Brain"     "Testes"    "Muscle"    "Intestine" "Heart"

# Which rows specify Liver?
# First make an index
foo_df$tissue == "Liver"

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

# Which values in the index are TRUE
which(foo_df$tissue == "Liver")

## [1] 1

# How many Liver samples are there?
# TRUE values = 1, so take the sum.
sum(foo_df$tissue == "Liver")

## [1] 1

Notice that we can combine an index of one column (i.e. foo_df\$healthy == TRUE) to extract values from another (foo_df\$quantity).

# The quantity values of only healthy observations
foo_df$quantity[foo_df$healthy == TRUE]

## [1]  1 19 25

To really make use of indexing with data frames, we can use [x,y] notation to select the $x{}^{th}$ row (i.e. observation) of the $y{}^{th}$ column (i.e. variable).

# The fourth row (i.e. observation)
foo_df[4,]

## # A tibble: 1 × 3
##   healthy tissue quantity
##   <lgl>   <chr>     <dbl>
## 1 TRUE    Muscle       19

# The fourth row, second column (i.e. variable)
foo_df[4,2]

## # A tibble: 1 × 1
##   tissue
##   <chr> 
## 1 Muscle

8.4 Exercises for Indexing

Complete the exercises on indexing in the diamonds data set chapter.