Chapter 8 2.1.4 Indexing
8.1 Learning Objectives
By the end of this chapter you should be familiar with:
- Indexing using
[]
, and
We already encountered a type of indexing when we used the subset()
function and logical expressions. Here, we will go one step further and perform indexing without using any unnecessary functions. We can filter our data-set according to:
- Equations (i.e. using logical expressions, see section ??
- Position (i.e. using row & column number, discussed here.), or
- Text matching (i.e. using regular expressions).
8.2 Indexing vectors (1D)
The beauty of indexing is that we can use []
notation for all of these types of questions. We can limit our investigation to specific data points by using the [x]
notation to select the \(x{}^{th}\) data point in a vector. Let’s return to the simple foo1
example.
# Our data-set foo1
## [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
6] # The value at position 6 foo1[
## [1] 36
# An object to use for indexing p
## [1] 6
# The value at position p foo1[p]
## [1] 36
3:p] # The values between position 3 and p foo1[
## [1] 15 22 29 36
3:length(foo1)] # Position 3 to the end of foo1 foo1[
## [1] 15 22 29 36 43 50 57 64 71 78 85 92 99
This is convenient, but becomes very powerful when we start mining our data by combining position with logical expressions. Recall that the result of a logical expression is a TRUE/FALSE answer, which can be used as an index.
# For every value in foo1, is it less than 50?
< 50 foo1
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE
# Save the logical expression as an object and use it
# as an index to report only the TRUE results
<- foo1 < 50
m foo1[m]
## [1] 1 8 15 22 29 36 43
# Use the logical expression as an index to report
# only the TRUE results
<50] foo1[foo1
## [1] 1 8 15 22 29 36 43
# Combine logical expressions to find values:
# Either less than or equal to 22 or greater than 71
<= 22 | foo1 > 71 foo1
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [13] TRUE TRUE TRUE
# How many are there:
sum(foo1 <= 22 | foo1 > 71)
## [1] 8
# What are they?
<= 22 | foo1 > 71] foo1[foo1
## [1] 1 8 15 22 78 85 92 99
8.3 Indexing Dataframes (2D)
Recall that each column in a data frame is the same as a vector. Returning to the foo_df
data frame we defined earlier, let’s try to extract the same information using logical expressions instead of the subset()
function.
# Our data frame
foo_df
## # A tibble: 6 × 3
## healthy tissue quantity
## <lgl> <chr> <dbl>
## 1 TRUE Liver 1
## 2 FALSE Brain 7
## 3 FALSE Testes 13
## 4 TRUE Muscle 19
## 5 TRUE Intestine 25
## 6 FALSE Heart 31
# The tissue variable (i.e. column, vector)
$tissue foo_df
## [1] "Liver" "Brain" "Testes" "Muscle" "Intestine" "Heart"
# Which rows specify Liver?
# First make an index
$tissue == "Liver" foo_df
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
# Which values in the index are TRUE
which(foo_df$tissue == "Liver")
## [1] 1
# How many Liver samples are there?
# TRUE values = 1, so take the sum.
sum(foo_df$tissue == "Liver")
## [1] 1
Notice that we can combine an index of one column (i.e. foo_df\$healthy == TRUE
) to extract values from another (foo_df\$quantity
).
# The quantity values of only healthy observations
$quantity[foo_df$healthy == TRUE] foo_df
## [1] 1 19 25
To really make use of indexing with data frames, we can use [x,y]
notation to select the $x{}^{th}$
row (i.e. observation) of the $y{}^{th}$
column (i.e. variable).
# The fourth row (i.e. observation)
4,] foo_df[
## # A tibble: 1 × 3
## healthy tissue quantity
## <lgl> <chr> <dbl>
## 1 TRUE Muscle 19
# The fourth row, second column (i.e. variable)
4,2] foo_df[
## # A tibble: 1 × 1
## tissue
## <chr>
## 1 Muscle
8.4 Exercises for Indexing
Complete the exercises on indexing in the diamonds
data set chapter.