Chapter 12 Element 6: Factor Variables

Learning Objectives

In this chapter you’ll learn:

  • How R encodes categorical variables

Factors variables are explanatory, categorical variables. Knowing how to set-up and use factors in R is an important aspect of data analysis.44

12.1 The Structure of Factors

A factor is a class that can be assigned to any of the atomic vector types.45 When this occurs, the underlying vector is converted to an integer and its values are saved as a character vector which is then used to describe the levels.

Recall in foo.df:

# [1] "Liver"     "Brain"     "Testes"    "Muscle"    "Intestine"
# [6] "Heart"
# [1] "character"
# [1] "character"
# [1] Liver     Brain     Testes    Muscle    Intestine Heart    
# Levels: Brain Heart Intestine Liver Muscle Testes
# [1] "integer"
# [1] "factor"

This can be a major space saving mechanism. Imagine saving many labels thousands of times over in a very large data frame, it takes up a lot of memory. R just needs to save a single integer for each label and stores the label as a character only once.

In R, factors are all simple integer vectors assigned a class of factor. This means that they have an associated character vector to describe the levels.

12.2 Factors and defining Linear Models

Let’s take a look at how Factors can affect how our statistics are performed. Recall that for the PlantGrowth data set we used:

# [1] "factor"
# ctrl trt1 trt2 
#  5.0  4.7  5.5
# 
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.071 -0.418 -0.006  0.263  1.369 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    5.032      0.197   25.53   <2e-16 ***
# grouptrt1     -0.371      0.279   -1.33    0.194    
# grouptrt2      0.494      0.279    1.77    0.088 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared:  0.264,   Adjusted R-squared:  0.21 
# F-statistic: 4.85 on 2 and 27 DF,  p-value: 0.0159
# [1] "ctrl" "trt1" "trt2"
# 
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.071 -0.418 -0.006  0.263  1.369 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    5.526      0.197   28.03   <2e-16 ***
# grouptrt1     -0.865      0.279   -3.10   0.0045 ** 
# groupctrl     -0.494      0.279   -1.77   0.0877 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared:  0.264,   Adjusted R-squared:  0.21 
# F-statistic: 4.85 on 2 and 27 DF,  p-value: 0.0159

The ANOVAs are different because the order of the factors was different! The first level is used as the reference.

Let’s consider what happens when group is a character and not a factor:

# Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  2 variables:
#  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
#  $ group : chr  "ctrl" "ctrl" "ctrl" "ctrl" ...
# ctrl trt1 trt2 
#  5.0  4.7  5.5
# 
# Call:
# lm(formula = weight ~ group, data = myDF)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.071 -0.418 -0.006  0.263  1.369 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    5.032      0.197   25.53   <2e-16 ***
# grouptrt1     -0.371      0.279   -1.33    0.194    
# grouptrt2      0.494      0.279    1.77    0.088 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared:  0.264,   Adjusted R-squared:  0.21 
# F-statistic: 4.85 on 2 and 27 DF,  p-value: 0.0159

We get the same results! Under the hood, R has quietly converted our character vector to a factor. However, here we don’t have any say as to the order the levels will be used. It is always advantageous to manually establish our factors.

12.3 Factors and Importing Data

When data is imported into R, each variable is automatically assigned a class. If a column contains characters, R will think it’s a categorical variable and classify it as a factor. Often times that’s not what you want, so it’s good practice to confirm your data frame’s structure before working on it.

# [1] "integer"
# [1] "factor"
# [1] "character"
# [1] "character"

12.4 Converting from Factors to other Classes

Converting between factor and other classes can be troublesome. If the labels are numerical, you cannot simply call as.numeric(). They must first be converted to characters and then to numbers:

# [1] "integer"
# [1] "integer"
#  [1] 21 22 23 24 25 26 27 28 29 30
# [1] "integer"
# [1] "factor"
#  [1] 21 22 23 24 25 26 27 28 29 30
# Levels: 21 22 23 24 25 26 27 28 29 30
#  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
#  [1] "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
# [1] "integer"
# [1] "integer"
#  [1]  1  2  3  4  5  6  7  8  9 10
# [1] "double"
# [1] "numeric"
#  [1] 21 22 23 24 25 26 27 28 29 30

12.5 Ordered Factors

Recall that there are two types of categorical variables, nominal and ordinal. So far we have only been dealing with ordinal variables, but we can also add order to a factor, as in:

# [1] small  large  small  medium small  medium
# Levels: large medium small
# [1] <NA> <NA> <NA> <NA> <NA> <NA>
# Levels: low middle high
# [1] small  large  small  medium small  medium
# Levels: small medium large
# [1] small  large  small  medium small  medium
# Levels: small < medium < large

12.6 Adding & Removing Factors Levels

To add new information to a factor, we fist have to add the new level. This is because we have already defined all the possible categories for the categorical variable in R.

# [1] "small"  "medium" "large"
# [1] small  large  small  medium small  medium <NA>  
# Levels: small medium large
# [1] "small"  "medium" "large"
# [1] small   large   small   medium  small   medium  X_large
# Levels: small medium large X_large

If we want to drop a level after removing all occurrences of a group, we just have to reestablish our factor.

# [1] large   medium  medium  X_large
# Levels: small medium large X_large
# [1] large   medium  medium  X_large
# Levels: medium large X_large

12.7 Relabelling Factors Levels

Factor levels are just character vectors, like what we saw for the column names in a data frame. This means that we only have to change the label in the character vector to change all occurrences of that label in the factor.

# [1] Liver     Brain     Testes    Muscle    Intestine Heart    
# Levels: Brain Heart Intestine Liver Muscle Testes
# [1] "Brain"     "Heart"     "Intestine" "Liver"     "Muscle"   
# [6] "Testes"
# [1] Liver     Brain     Testes    Tongue    Intestine Heart    
# Levels: Brain Heart Intestine Liver Tongue Testes
# [1] Liver     Forebrain Testes    Tongue    Intestine Heart    
# Levels: Forebrain Heart Intestine Liver Tongue Testes

  1. Explanatory variables are used when comparing and summarizing measured values of different groups, such as control versus treatment. The measured value is a response variable to the explanatory variable.

  2. See ?? for a review of the different atomic vector types.