11 Factor Variables
Learning Objectives
In this chapter you’ll learn:
- How R encodes categorical variables
Factors variables are explanatory, categorical variables. Knowing how to set-up and use factors in R is an important aspect of data analysis.42
11.1 The Structure of Factors
A factor is a class that can be assigned to any of the atomic vector types.43 When this occurs, the underlying vector is converted to an integer and its values are saved as a character vector which is then used to describe the levels.
Recall in foo_df
:
# Our original character vector
foo3
#> [1] "Liver" "Brain" "Testes" "Muscle"
#> [5] "Intestine" "Heart"
typeof(foo3)
#> [1] "character"
class(foo3)
#> [1] "character"
# We used foo3 to make foo_df$tissue
foo_df$tissue
#> [1] "Liver" "Brain" "Testes" "Muscle"
#> [5] "Intestine" "Heart"
typeof(foo_df$tissue)
#> [1] "character"
# Which was converted to a factor:
class(foo_df$tissue)
#> [1] "character"
This can be a major space saving mechanism. Imagine saving many labels thousands of times over in a very large data frame, it takes up a lot of memory. R just needs to save a single integer for each label and stores the label as a character only once.
In R, factors are all simple integer vectors assigned a class of factor. This means that they have an associated character vector to describe the levels.
11.2 Factors and defining Linear Models
Let’s take a look at how Factors can affect how our statistics are performed. Recall that for the PlantGrowth
data set we used:
class(PlantGrowth$group)
#> [1] "factor"
levels(PlantGrowth$group)
#> [1] "ctrl" "trt1" "trt2"
# tapply(PlantGrowth$weight, PlantGrowth$group, mean)
summary(lm(weight ~ group, data = PlantGrowth))
#>
#> Call:
#> lm(formula = weight ~ group, data = PlantGrowth)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.071 -0.418 -0.006 0.263 1.369
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.032 0.197 25.53 <2e-16 ***
#> grouptrt1 -0.371 0.279 -1.33 0.194
#> grouptrt2 0.494 0.279 1.77 0.088 .
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.62 on 27 degrees of freedom
#> Multiple R-squared: 0.264, Adjusted R-squared: 0.21
#> F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
# Reorder the levels
PlantGrowth$group <- factor(PlantGrowth$group,
levels = c("trt2", "trt1", "ctrl"))
levels(PlantGrowth$group)
#> [1] "trt2" "trt1" "ctrl"
summary(lm(weight ~ group, data = PlantGrowth))
#>
#> Call:
#> lm(formula = weight ~ group, data = PlantGrowth)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.071 -0.418 -0.006 0.263 1.369
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.526 0.197 28.03 <2e-16 ***
#> grouptrt1 -0.865 0.279 -3.10 0.0045 **
#> groupctrl -0.494 0.279 -1.77 0.0877 .
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.62 on 27 degrees of freedom
#> Multiple R-squared: 0.264, Adjusted R-squared: 0.21
#> F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
The ANOVAs are different because the order of the factors was different! The first level is used as the reference.
Let’s consider what happens when group
is a character and not a factor:
myDF <- PlantGrowth
myDF$group <- as.character(myDF$group)
str(myDF)
#> tibble [30 × 2] (S3: tbl_df/tbl/data.frame)
#> $ weight: num [1:30] 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
#> $ group : chr [1:30] "ctrl" "ctrl" "ctrl" "ctrl" ...
# tapply(myDF$weight, myDF$group, mean)
summary(lm(weight ~ group, data = myDF))
#>
#> Call:
#> lm(formula = weight ~ group, data = myDF)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.071 -0.418 -0.006 0.263 1.369
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.032 0.197 25.53 <2e-16 ***
#> grouptrt1 -0.371 0.279 -1.33 0.194
#> grouptrt2 0.494 0.279 1.77 0.088 .
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.62 on 27 degrees of freedom
#> Multiple R-squared: 0.264, Adjusted R-squared: 0.21
#> F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
We get the same results! Under the hood, R has quietly converted our character vector to a factor. However, here we don’t have any say as to the order the levels will be used. It is always advantageous to manually establish our factors.
11.3 Factors and Importing Data
Review the section on the stringsAsFactors
argument in the importing data section. When data is imported into R, each variable is automatically assigned a type If a column contains characters, R v4.0 and above will give assign is as such when using the read.*()
functions. Earlier versions will think it’s a categorical variable, and classify it as a factor, unless you set stringsAsFactors = FALSE
.
11.4 Converting from Factors to other Classes
Converting between factor and other classes can be troublesome. If the labels are numerical, you cannot simply call as.numeric()
. They must first be converted to characters and then to numbers:
foo5 <- 21:30
typeof(foo5)
#> [1] "integer"
class(foo5) # An integer vector
#> [1] "integer"
foo5
#> [1] 21 22 23 24 25 26 27 28 29 30
foo5 <- as.factor(foo5)
typeof(foo5)
#> [1] "integer"
class(foo5) # A integer vector of factor class
#> [1] "factor"
foo5
#> [1] 21 22 23 24 25 26 27 28 29 30
#> Levels: 21 22 23 24 25 26 27 28 29 30
# What is actually stored (like an ID).
labels(foo5)
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
# The names of the levels assigned to each label, in order
levels(foo5)
#> [1] "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
foo5 <- as.integer(foo5) # or also use as.numeric()
typeof(foo5)
#> [1] "integer"
class(foo5) # A numeric vector, but...
#> [1] "integer"
# This takes the labels, not the levels assigned to those labels
foo5 # The values are gone!! Bad Move!!
#> [1] 1 2 3 4 5 6 7 8 9 10
# Do this instead:
foo5 <- as.factor(21:30) # re-establish factor
foo5 <- as.numeric(as.character(foo5))
typeof(foo5)
#> [1] "double"
class(foo5) # A numeric vector with the proper values
#> [1] "numeric"
foo5
#> [1] 21 22 23 24 25 26 27 28 29 30
11.5 Ordered Factors
Recall that there are two types of categorical variables, nominal and ordinal. So far we have only been dealing with ordinal variables, but we can also add order to a factor, as in:
foo6 <- c("small", "large", "small", "medium", "small", "medium")
factor(foo6) # improper order
#> [1] small large small medium small medium
#> Levels: large medium small
# labels don't exist!
factor(foo6, levels = c("low", "middle", "high"))
#> [1] <NA> <NA> <NA> <NA> <NA> <NA>
#> Levels: low middle high
# proper order
factor(foo6, levels = c("small", "medium", "large"))
#> [1] small large small medium small medium
#> Levels: small medium large
# defined proper order. factor(x, ordered=TRUE) is also possible.
ordered(foo6, levels = c("small", "medium", "large"))
#> [1] small large small medium small medium
#> Levels: small < medium < large
11.6 Adding & Removing Factors Levels
To add new information to a factor, we fist have to add the new level. This is because we have already defined all the possible categories for the categorical variable in R.
# create a 6-element long factor
foo7 <- factor(foo6,
levels = c("small", "medium", "large"))
levels(foo7)
#> [1] "small" "medium" "large"
foo7[7] <- "extra large"
#> Warning in `[<-.factor`(`*tmp*`, 7, value = "extra large"):
#> invalid factor level, NA generated
foo7
#> [1] small large small medium small medium <NA>
#> Levels: small medium large
levels(foo7)
#> [1] "small" "medium" "large"
foo7 <- factor(foo7, levels = c(levels(foo7), "X_large"))
foo7[7] <- "X_large"
foo7
#> [1] small large small medium small medium X_large
#> Levels: small medium large X_large
If we want to drop a level after removing all occurrences of a group, we just have to reestablish our factor.
foo7 <- foo7[foo7 != "small"]
# Old levels remain in the character vector.
# We need to reinitialize the factor to get rid of them.
foo7
#> [1] large medium medium X_large
#> Levels: small medium large X_large
foo7 <- factor(foo7)
foo7
#> [1] large medium medium X_large
#> Levels: medium large X_large
11.7 Relabeling Factors Levels
Factor levels are just character vectors, like what we saw for the column names in a data frame. This means that we only have to change the label in the character vector to change all occurrences of that label in the factor.
# The levels for the tissue factor
foo_df$tissue <- as.factor(foo_df$tissue)
foo_df$tissue
levels(foo_df$tissue)
The levels are stored as a character vector, so we can just change the a level directly:
levels(foo_df$tissue)[5] <- "Tongue"
foo_df$tissue
Or we can specify it using a regular expression: in the grep()
function. "^B"
means look for anything that begins with a capital B: