Chapter 11 Factor Variables
Learning Objectives
In this chapter you’ll learn:
- How R encodes categorical variables
Factors variables are explanatory, categorical variables. Knowing how to set-up and use factors in R is an important aspect of data analysis.42
11.1 The Structure of Factors
A factor is a class that can be assigned to any of the atomic vector types.43 When this occurs, the underlying vector is converted to an integer and its values are saved as a character vector which is then used to describe the levels.
Recall in foo_df
:
# Our original character vector
foo3
## [1] "Liver" "Brain" "Testes" "Muscle" "Intestine" "Heart"
typeof(foo3)
## [1] "character"
class(foo3)
## [1] "character"
# We used foo3 to make foo_df$tissue
$tissue foo_df
## [1] "Liver" "Brain" "Testes" "Muscle" "Intestine" "Heart"
typeof(foo_df$tissue)
## [1] "character"
# Which was converted to a factor:
class(foo_df$tissue)
## [1] "character"
This can be a major space saving mechanism. Imagine saving many labels thousands of times over in a very large data frame, it takes up a lot of memory. R just needs to save a single integer for each label and stores the label as a character only once.
In R, factors are all simple integer vectors assigned a class of factor. This means that they have an associated character vector to describe the levels.
11.2 Factors and defining Linear Models
Let’s take a look at how Factors can affect how our statistics are performed. Recall that for the PlantGrowth
data set we used:
class(PlantGrowth$group)
## [1] "factor"
levels(PlantGrowth$group)
## [1] "ctrl" "trt1" "trt2"
# tapply(PlantGrowth$weight, PlantGrowth$group, mean)
summary(lm(weight ~ group, data = PlantGrowth))
##
## Call:
## lm(formula = weight ~ group, data = PlantGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.071 -0.418 -0.006 0.263 1.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.032 0.197 25.53 <2e-16 ***
## grouptrt1 -0.371 0.279 -1.33 0.194
## grouptrt2 0.494 0.279 1.77 0.088 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.62 on 27 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.21
## F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
# Reorder the levels
$group <- factor(PlantGrowth$group,
PlantGrowthlevels = c("trt2", "trt1", "ctrl"))
levels(PlantGrowth$group)
## [1] "trt2" "trt1" "ctrl"
summary(lm(weight ~ group, data = PlantGrowth))
##
## Call:
## lm(formula = weight ~ group, data = PlantGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.071 -0.418 -0.006 0.263 1.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.526 0.197 28.03 <2e-16 ***
## grouptrt1 -0.865 0.279 -3.10 0.0045 **
## groupctrl -0.494 0.279 -1.77 0.0877 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.62 on 27 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.21
## F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
The ANOVAs are different because the order of the factors was different! The first level is used as the reference.
Let’s consider what happens when group
is a character and not a factor:
<- PlantGrowth
myDF $group <- as.character(myDF$group)
myDFstr(myDF)
## tibble [30 × 2] (S3: tbl_df/tbl/data.frame)
## $ weight: num [1:30] 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : chr [1:30] "ctrl" "ctrl" "ctrl" "ctrl" ...
# tapply(myDF$weight, myDF$group, mean)
summary(lm(weight ~ group, data = myDF))
##
## Call:
## lm(formula = weight ~ group, data = myDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.071 -0.418 -0.006 0.263 1.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.032 0.197 25.53 <2e-16 ***
## grouptrt1 -0.371 0.279 -1.33 0.194
## grouptrt2 0.494 0.279 1.77 0.088 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.62 on 27 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.21
## F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
We get the same results! Under the hood, R has quietly converted our character vector to a factor. However, here we don’t have any say as to the order the levels will be used. It is always advantageous to manually establish our factors.
11.3 Factors and Importing Data
Review the section on the stringsAsFactors
argument in the importing data section. When data is imported into R, each variable is automatically assigned a type If a column contains characters, R v4.0 and above will give assign is as such when using the read.*()
functions. Earlier versions will think it’s a categorical variable, and classify it as a factor, unless you set stringsAsFactors = FALSE
.
11.4 Converting from Factors to other Classes
Converting between factor and other classes can be troublesome. If the labels are numerical, you cannot simply call as.numeric()
. They must first be converted to characters and then to numbers:
<- 21:30
foo5 typeof(foo5)
## [1] "integer"
class(foo5) # An integer vector
## [1] "integer"
foo5
## [1] 21 22 23 24 25 26 27 28 29 30
<- as.factor(foo5)
foo5 typeof(foo5)
## [1] "integer"
class(foo5) # A integer vector of factor class
## [1] "factor"
foo5
## [1] 21 22 23 24 25 26 27 28 29 30
## Levels: 21 22 23 24 25 26 27 28 29 30
# What is actually stored (like an ID).
labels(foo5)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
# The names of the levels assigned to each label, in order
levels(foo5)
## [1] "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
<- as.integer(foo5) # or also use as.numeric()
foo5 typeof(foo5)
## [1] "integer"
class(foo5) # A numeric vector, but...
## [1] "integer"
# This takes the labels, not the levels assigned to those labels
# The values are gone!! Bad Move!! foo5
## [1] 1 2 3 4 5 6 7 8 9 10
# Do this instead:
<- as.factor(21:30) # re-establish factor
foo5 <- as.numeric(as.character(foo5))
foo5 typeof(foo5)
## [1] "double"
class(foo5) # A numeric vector with the proper values
## [1] "numeric"
foo5
## [1] 21 22 23 24 25 26 27 28 29 30
11.5 Ordered Factors
Recall that there are two types of categorical variables, nominal and ordinal. So far we have only been dealing with ordinal variables, but we can also add order to a factor, as in:
<- c("small", "large", "small", "medium", "small", "medium")
foo6 factor(foo6) # improper order
## [1] small large small medium small medium
## Levels: large medium small
# labels don't exist!
factor(foo6, levels = c("low", "middle", "high"))
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: low middle high
# proper order
factor(foo6, levels = c("small", "medium", "large"))
## [1] small large small medium small medium
## Levels: small medium large
# defined proper order. factor(x, ordered=TRUE) is also possible.
ordered(foo6, levels = c("small", "medium", "large"))
## [1] small large small medium small medium
## Levels: small < medium < large
11.6 Adding & Removing Factors Levels
To add new information to a factor, we fist have to add the new level. This is because we have already defined all the possible categories for the categorical variable in R.
# create a 6-element long factor
<- factor(foo6,
foo7 levels = c("small", "medium", "large"))
levels(foo7)
## [1] "small" "medium" "large"
7] <- "extra large" foo7[
## Warning in `[<-.factor`(`*tmp*`, 7, value = "extra large"): invalid factor
## level, NA generated
foo7
## [1] small large small medium small medium <NA>
## Levels: small medium large
levels(foo7)
## [1] "small" "medium" "large"
<- factor(foo7, levels = c(levels(foo7), "X_large"))
foo7 7] <- "X_large"
foo7[ foo7
## [1] small large small medium small medium X_large
## Levels: small medium large X_large
If we want to drop a level after removing all occurrences of a group, we just have to reestablish our factor.
<- foo7[foo7 != "small"]
foo7 # Old levels remain in the character vector.
# We need to reinitialize the factor to get rid of them.
foo7
## [1] large medium medium X_large
## Levels: small medium large X_large
<- factor(foo7)
foo7 foo7
## [1] large medium medium X_large
## Levels: medium large X_large
11.7 Relabeling Factors Levels
Factor levels are just character vectors, like what we saw for the column names in a data frame. This means that we only have to change the label in the character vector to change all occurrences of that label in the factor.
# The levels for the tissue factor
$tissue <- as.factor(foo_df$tissue)
foo_df$tissue
foo_dflevels(foo_df$tissue)
The levels are stored as a character vector, so we can just change the a level directly:
levels(foo_df$tissue)[5] <- "Tongue"
$tissue foo_df
Or we can specify it using a regular expression: in the grep()
function. "^B"
means look for anything that begins with a capital B:
levels(foo_df$tissue)[grep("^B", levels(foo_df$tissue))] <- "Forebrain"
$tissue foo_df