# Chapter 11 Element 5: Factor Variables

Learning Objectives

In this chapter you’ll learn:

- How R encodes categorical variables

Factors variables are explanatory, categorical variables. Knowing how to set-up and use factors in R is an important aspect of data analysis.^{44}

## 11.1 The Structure of Factors

A factor is a class that can be assigned to any of the atomic vector types.^{45} When this occurs, the underlying vector is converted to an integer and its values are saved as a character vector which is then used to describe the levels.

Recall in `foo_df`

:

```
# [1] "Liver" "Brain" "Testes" "Muscle" "Intestine"
# [6] "Heart"
```

`# [1] "character"`

`# [1] "character"`

```
# [1] "Liver" "Brain" "Testes" "Muscle" "Intestine"
# [6] "Heart"
```

`# [1] "character"`

`# [1] "character"`

This can be a major space saving mechanism. Imagine saving many labels thousands of times over in a very large data frame, it takes up a lot of memory. R just needs to save a single integer for each label and stores the label as a character only once.

In R, factors are all simple integer vectors assigned a class of factor. This means that they have an associated character vector to describe the levels.

## 11.2 Factors and defining Linear Models

Let’s take a look at how Factors can affect how our statistics are performed. Recall that for the `PlantGrowth`

data set we used:

`# [1] "factor"`

`# [1] "ctrl" "trt1" "trt2"`

```
# tapply(PlantGrowth$weight, PlantGrowth$group, mean)
summary(lm(weight ~ group, data = PlantGrowth))
```

```
#
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.071 -0.418 -0.006 0.263 1.369
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 5.032 0.197 25.53 <2e-16 ***
# grouptrt1 -0.371 0.279 -1.33 0.194
# grouptrt2 0.494 0.279 1.77 0.088 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared: 0.264, Adjusted R-squared: 0.21
# F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
```

```
# Reorder the levels
PlantGrowth$group <- factor(PlantGrowth$group,
levels = c("trt2", "trt1", "ctrl"))
levels(PlantGrowth$group)
```

`# [1] "trt2" "trt1" "ctrl"`

```
#
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.071 -0.418 -0.006 0.263 1.369
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 5.526 0.197 28.03 <2e-16 ***
# grouptrt1 -0.865 0.279 -3.10 0.0045 **
# groupctrl -0.494 0.279 -1.77 0.0877 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared: 0.264, Adjusted R-squared: 0.21
# F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
```

The ANOVAs are different because the order of the factors was different! The first level is used as the reference.

Let’s consider what happens when `group`

is a character and not a factor:

```
# tibble [30 × 2] (S3: tbl_df/tbl/data.frame)
# $ weight: num [1:30] 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
# $ group : chr [1:30] "ctrl" "ctrl" "ctrl" "ctrl" ...
```

```
#
# Call:
# lm(formula = weight ~ group, data = myDF)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.071 -0.418 -0.006 0.263 1.369
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 5.032 0.197 25.53 <2e-16 ***
# grouptrt1 -0.371 0.279 -1.33 0.194
# grouptrt2 0.494 0.279 1.77 0.088 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared: 0.264, Adjusted R-squared: 0.21
# F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
```

We get the same results! Under the hood, R has quietly converted our character vector to a factor. However, here we don’t have any say as to the order the levels will be used. It is always advantageous to manually establish our factors.

## 11.3 Factors and Importing Data

When data is imported into R, each variable is automatically assigned a class. If a column contains characters, R will think it’s a categorical variable and classify it as a factor. Often times that’s *not* what you want, so it’s good practice to confirm your data frame’s structure before working on it.

`# [1] "character"`

`# [1] "character"`

```
# We can reformat them:
protein_df$Description <-
as.character(protein_df$Description)
class(protein_df$Description)
```

`# [1] "character"`

```
# But we can also prevent this when importing:
protein_df <- read.delim("data/Protein.txt",
stringsAsFactors = FALSE)
class(protein_df$Description)
```

`# [1] "character"`

## 11.4 Converting from Factors to other Classes

Converting between factor and other classes can be troublesome. If the labels are numerical, you cannot simply call `as.numeric()`

. They must first be converted to characters and then to numbers:

`# [1] "integer"`

`# [1] "integer"`

`# [1] 21 22 23 24 25 26 27 28 29 30`

`# [1] "integer"`

`# [1] "factor"`

```
# [1] 21 22 23 24 25 26 27 28 29 30
# Levels: 21 22 23 24 25 26 27 28 29 30
```

`# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"`

`# [1] "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"`

`# [1] "integer"`

`# [1] "integer"`

```
# This takes the labels, not the levels assigned to those labels
foo5 # The values are gone!! Bad Move!!
```

`# [1] 1 2 3 4 5 6 7 8 9 10`

```
# Do this instead:
foo5 <- as.factor(21:30) # re-establish factor
foo5 <- as.numeric(as.character(foo5))
typeof(foo5)
```

`# [1] "double"`

`# [1] "numeric"`

`# [1] 21 22 23 24 25 26 27 28 29 30`

## 11.5 Ordered Factors

Recall that there are two types of categorical variables, nominal and ordinal. So far we have only been dealing with ordinal variables, but we can also add order to a factor, as in:

```
# [1] small large small medium small medium
# Levels: large medium small
```

```
# [1] <NA> <NA> <NA> <NA> <NA> <NA>
# Levels: low middle high
```

```
# [1] small large small medium small medium
# Levels: small medium large
```

```
# defined proper order. factor(x, ordered=TRUE) is also possible.
ordered(foo6, levels = c("small", "medium", "large"))
```

```
# [1] small large small medium small medium
# Levels: small < medium < large
```

## 11.6 Adding & Removing Factors Levels

To add new information to a factor, we fist have to add the new level. This is because we have already defined all the possible categories for the categorical variable in R.

```
# create a 6-element long factor
foo7 <- factor(foo6,
levels = c("small", "medium", "large"))
levels(foo7)
```

`# [1] "small" "medium" "large"`

```
# [1] small large small medium small medium <NA>
# Levels: small medium large
```

`# [1] "small" "medium" "large"`

```
# [1] small large small medium small medium X_large
# Levels: small medium large X_large
```

If we want to drop a level after removing all occurrences of a group, we just have to reestablish our factor.

```
foo7 <- foo7[foo7 != "small"]
# Old levels remain in the character vector.
# We need to reinitialize the factor to get rid of them.
foo7
```

```
# [1] large medium medium X_large
# Levels: small medium large X_large
```

```
# [1] large medium medium X_large
# Levels: medium large X_large
```

## 11.7 Relabeling Factors Levels

Factor levels are just character vectors, like what we saw for the column names in a data frame. This means that we only have to change the label in the character vector to change all occurrences of that label in the factor.

```
# The levels for the tissue factor
foo_df$tissue <- as.factor(foo_df$tissue)
foo_df$tissue
levels(foo_df$tissue)
```

The levels are stored as a character vector, so we can just change the a level directly:

Or we can specify it using a regular expression: in the `grep()`

function. `"^B"`

means *look for anything that begins with a capital B*: