Chapter 12 Element 6: Factor Variables

Learning Objectives

In this chapter you’ll learn:

How R encodes categorical variables

Factors variables are explanatory, categorical variables. Knowing how to set-up and use factors in R is an important aspect of data analysis.⁴⁴

12.1 The Structure of Factors

A factor is a class that can be assigned to any of the atomic vector types.⁴⁵ When this occurs, the underlying vector is converted to an integer and its values are saved as a character vector which is then used to describe the levels.

Recall in foo.df:

# Our original character vector
foo3

# [1] "Liver"     "Brain"     "Testes"    "Muscle"    "Intestine"
# [6] "Heart"

typeof(foo3)

# [1] "character"

class(foo3)

# [1] "character"

# We used foo3 to make foo.df$tissue
foo.df$tissue

# [1] Liver     Brain     Testes    Muscle    Intestine Heart    
# Levels: Brain Heart Intestine Liver Muscle Testes

typeof(foo.df$tissue)

# [1] "integer"

# Which was converted to a factor:
class(foo.df$tissue)

# [1] "factor"

This can be a major space saving mechanism. Imagine saving many labels thousands of times over in a very large data frame, it takes up a lot of memory. R just needs to save a single integer for each label and stores the label as a character only once.

In R, factors are all simple integer vectors assigned a class of factor. This means that they have an associated character vector to describe the levels.

12.2 Factors and defining Linear Models

Let’s take a look at how Factors can affect how our statistics are performed. Recall that for the PlantGrowth data set we used:

class(PlantGrowth$group)

# [1] "factor"

tapply(PlantGrowth$weight, PlantGrowth$group, mean)

# ctrl trt1 trt2 
#  5.0  4.7  5.5

summary(lm(weight ~ group, data = PlantGrowth))

# 
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.071 -0.418 -0.006  0.263  1.369 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    5.032      0.197   25.53   <2e-16 ***
# grouptrt1     -0.371      0.279   -1.33    0.194    
# grouptrt2      0.494      0.279    1.77    0.088 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared:  0.264,   Adjusted R-squared:  0.21 
# F-statistic: 4.85 on 2 and 27 DF,  p-value: 0.0159

levels(PlantGrowth$group)

# [1] "ctrl" "trt1" "trt2"

# Reorder the levels
PlantGrowth$group <- factor(PlantGrowth$group,
                            levels = c("trt2", "trt1", "ctrl"))
summary(lm(weight ~ group, data = PlantGrowth))

# 
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.071 -0.418 -0.006  0.263  1.369 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    5.526      0.197   28.03   <2e-16 ***
# grouptrt1     -0.865      0.279   -3.10   0.0045 ** 
# groupctrl     -0.494      0.279   -1.77   0.0877 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared:  0.264,   Adjusted R-squared:  0.21 
# F-statistic: 4.85 on 2 and 27 DF,  p-value: 0.0159

The ANOVAs are different because the order of the factors was different! The first level is used as the reference.

Let’s consider what happens when group is a character and not a factor:

myDF <- PlantGrowth
myDF$group <- as.character(myDF$group)
str(myDF)

# Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  2 variables:
#  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
#  $ group : chr  "ctrl" "ctrl" "ctrl" "ctrl" ...

tapply(myDF$weight, myDF$group, mean)

# ctrl trt1 trt2 
#  5.0  4.7  5.5

summary(lm(weight ~ group, data = myDF))

# 
# Call:
# lm(formula = weight ~ group, data = myDF)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -1.071 -0.418 -0.006  0.263  1.369 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    5.032      0.197   25.53   <2e-16 ***
# grouptrt1     -0.371      0.279   -1.33    0.194    
# grouptrt2      0.494      0.279    1.77    0.088 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared:  0.264,   Adjusted R-squared:  0.21 
# F-statistic: 4.85 on 2 and 27 DF,  p-value: 0.0159

We get the same results! Under the hood, R has quietly converted our character vector to a factor. However, here we don’t have any say as to the order the levels will be used. It is always advantageous to manually establish our factors.

12.3 Factors and Importing Data

When data is imported into R, each variable is automatically assigned a class. If a column contains characters, R will think it’s a categorical variable and classify it as a factor. Often times that’s not what you want, so it’s good practice to confirm your data frame’s structure before working on it.

# Description and Uniprot are wrongly labelled as factors
typeof(protein.df$Description)

# [1] "integer"

class(protein.df$Description)

# [1] "factor"

# We can reformat them:
protein.df$Description <-
  as.character(protein.df$Description)
class(protein.df$Description)

# [1] "character"

# But we can also prevent this when importing:
protein.df <- read.delim("data/Protein.txt",
                         stringsAsFactors = FALSE)
class(protein.df$Description)

# [1] "character"

12.4 Converting from Factors to other Classes

Converting between factor and other classes can be troublesome. If the labels are numerical, you cannot simply call as.numeric(). They must first be converted to characters and then to numbers:

foo5 <- 21:30
typeof(foo5)

# [1] "integer"

class(foo5) # An integer vector

# [1] "integer"

foo5

#  [1] 21 22 23 24 25 26 27 28 29 30

foo5 <- as.factor(foo5)
typeof(foo5)

# [1] "integer"

class(foo5) # A integer vector of factor class

# [1] "factor"

foo5

#  [1] 21 22 23 24 25 26 27 28 29 30
# Levels: 21 22 23 24 25 26 27 28 29 30

# What is actually stored (like an ID).
labels(foo5)

#  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

# The names of the levels assigned to each label, in order
levels(foo5)

#  [1] "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"

foo5 <- as.integer(foo5) # or also use as.numeric()
typeof(foo5)

# [1] "integer"

class(foo5) # A numeric vector, but...

# [1] "integer"

# This takes the labels, not the levels assigned to those labels
foo5 # The values are gone!! Bad Move!!

#  [1]  1  2  3  4  5  6  7  8  9 10

# Do this instead:
foo5 <- as.factor(21:30) # re-establish factor
foo5 <- as.numeric(as.character(foo5))
typeof(foo5)

# [1] "double"

class(foo5) # A numeric vector with the proper values

# [1] "numeric"

foo5

#  [1] 21 22 23 24 25 26 27 28 29 30

12.5 Ordered Factors

Recall that there are two types of categorical variables, nominal and ordinal. So far we have only been dealing with ordinal variables, but we can also add order to a factor, as in:

foo6 <- c("small", "large", "small", "medium", "small", "medium")
factor(foo6) # improper order

# [1] small  large  small  medium small  medium
# Levels: large medium small

# labels don't exist!
factor(foo6, levels = c("low", "middle", "high"))

# [1] <NA> <NA> <NA> <NA> <NA> <NA>
# Levels: low middle high

# proper order
factor(foo6, levels = c("small", "medium", "large"))

# [1] small  large  small  medium small  medium
# Levels: small medium large

# defined proper order. factor(x, ordered=TRUE) is also possible.
ordered(foo6, levels = c("small", "medium", "large"))

# [1] small  large  small  medium small  medium
# Levels: small < medium < large

12.6 Adding & Removing Factors Levels

To add new information to a factor, we fist have to add the new level. This is because we have already defined all the possible categories for the categorical variable in R.

# create a 6-element long factor
foo7 <- factor(foo6,
               levels = c("small", "medium", "large"))
levels(foo7)

# [1] "small"  "medium" "large"

foo7[7] <- "extra large"
foo7

# [1] small  large  small  medium small  medium <NA>  
# Levels: small medium large

levels(foo7)

# [1] "small"  "medium" "large"

foo7 <- factor(foo7, levels = c(levels(foo7), "X_large"))
foo7[7] <- "X_large"
foo7

# [1] small   large   small   medium  small   medium  X_large
# Levels: small medium large X_large

If we want to drop a level after removing all occurrences of a group, we just have to reestablish our factor.

foo7 <- foo7[foo7 != "small"]
# Old levels remain in the character vector.
# We need to reinitialize the factor to get rid of them.
foo7

# [1] large   medium  medium  X_large
# Levels: small medium large X_large

foo7 <- factor(foo7)
foo7

# [1] large   medium  medium  X_large
# Levels: medium large X_large

12.7 Relabelling Factors Levels

Factor levels are just character vectors, like what we saw for the column names in a data frame. This means that we only have to change the label in the character vector to change all occurrences of that label in the factor.

# The levels for the tissue factor
foo.df$tissue

# [1] Liver     Brain     Testes    Muscle    Intestine Heart    
# Levels: Brain Heart Intestine Liver Muscle Testes

levels(foo.df$tissue)

# [1] "Brain"     "Heart"     "Intestine" "Liver"     "Muscle"   
# [6] "Testes"

# We can change the first level directly
levels(foo.df$tissue)[5] <- "Tongue"
foo.df$tissue

# [1] Liver     Brain     Testes    Tongue    Intestine Heart    
# Levels: Brain Heart Intestine Liver Tongue Testes

# Or specify using regluar expressions
levels(foo.df$tissue)[grep("^B",
    levels(foo.df$tissue))] <- "Forebrain"
foo.df$tissue

# [1] Liver     Forebrain Testes    Tongue    Intestine Heart    
# Levels: Forebrain Heart Intestine Liver Tongue Testes

Explanatory variables are used when comparing and summarizing measured values of different groups, such as control versus treatment. The measured value is a response variable to the explanatory variable.↩
See ?? for a review of the different atomic vector types.↩