Random forests are a modification of bagged decision trees that build a large collection of de-correlated trees to further improve predictive performance. They have become a very popular “out-of-the-box” or “off-the-shelf” learning algorithm that enjoys good predictive performance with relatively little hyperparameter tuning. Many modern implementations of random forests exist; however, Leo Breiman’s algorithm (Breiman 2001) has largely become the authoritative procedure. This module will cover the fundamentals of random forests.

Learning objectives

By the end of this module you will know:

  • How to implement a random forest model along with the hyperparameters that are commonly toggled in these algorithms.
  • Multiple strategies for performing a grid search.
  • How to identify influential features and their effects on the response variable.

Prerequisites

# Helper packages
import numpy as np
import pandas as pd
from plotnine import *
from scipy.stats import uniform
from scipy.stats import randint

# Modeling packages
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.inspection import partial_dependence
from sklearn.pipeline import Pipeline
# Ames housing data
ames = pd.read_csv("data/ames.csv")

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]

# Helper packages
library(tidyverse)   # for data wrangling & plotting

# Modeling packages
library(tidymodels) 

# Model interpretability packages
library(vip)         # for variable importance
library(pdp)         # for variable relationships
# Stratified sampling with the rsample package
set.seed(123)
ames <- AmesHousing::make_ames()
split  <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- rsample::training(split)
ames_test   <- rsample::testing(split)

Extending bagging

Random forests are built using the same fundamental principles as decision trees and bagging. Bagging trees introduces a random component into the tree building process by building many trees on bootstrapped copies of the training data. Bagging then aggregates the predictions across all the trees; this aggregation reduces the variance of the overall procedure and results in improved predictive performance. However, as we saw in the last module, simply bagging trees results in tree correlation that limits the effect of variance reduction.

Random forests help to reduce tree correlation by injecting more randomness into the tree-growing process.1 More specifically, while growing a decision tree during the bagging process, random forests perform split-variable randomization where each time a split is to be performed, the search for the split variable is limited to a random subset of \(m_{try}\) of the original \(p\) features. Typical default values are \(m_{try} = \frac{p}{3}\) (regression) and \(m_{try} = \sqrt{p}\) (classification) but this should be considered a tuning parameter.

The basic algorithm for a regression or classification random forest can be generalized as follows:

1.  Given a training data set
2.  Select number of trees to build (n_trees)
3.  for i = 1 to n_trees do
4.  |  Generate a bootstrap sample of the original data
5.  |  Grow a regression/classification tree to the bootstrapped data
6.  |  for each split do
7.  |  | Select m_try variables at random from all p variables
8.  |  | Pick the best variable/split-point among the m_try
9.  |  | Split the node into two child nodes
10. |  end
11. | Use typical tree model stopping criteria to determine when a 
    | tree is complete (but do not prune)
12. end
13. Output ensemble of trees 

When \(m_{try} = p\), the algorithm is equivalent to bagging decision trees.

Since the algorithm randomly selects a bootstrap sample to train on and a random sample of features to use at each split, a more diverse set of trees is produced which tends to lessen tree correlation beyond bagged trees and often dramatically increase predictive power.

Out-of-the-box performance

Random forests have become popular because they tend to provide very good out-of-the-box performance. Although they have several hyperparameters that can be tuned, the default values tend to produce good results. Moreover, Probst, Bischl, and Boulesteix (2018) illustrated that among the more popular machine learning algorithms, random forests have the least variability in their prediction accuracy when tuning.

For example, if we train a random forest model with all hyperparameters set to their default values, we get RMSEs comparable to some of the best model’s we’ve run thus far (without any tuning).

In Python we use RandomForestRegressor and RandomForestClassifier to performing random forest models. Similar to the bagging example in the last module we are going to ordinal encode our Quality/Condition features (i.e. Overall_Qual, Garage_Qual, Kitchen_Qual) and, as usually, we need to one-hot encode our remaining nominal features.

# Ordinal encode our quality-based features 
ord_cols = list(X_train.filter(regex=("Qual$|QC$|Cond$")).columns)
lvs = ["Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical", 
       "Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"]
val = range(0, len(lvs))
lvl_map = dict(zip(lvs, val))
category_mapping = [{'col': col, 'mapping': lvl_map} for col in ord_cols]
ord_encoder = OrdinalEncoder(cols=ord_cols, mapping=category_mapping)

# one hot encode remaining nominal features
encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)

# combine into a pre-processing pipeline
preprocessor = ColumnTransformer(
  remainder="passthrough",
  transformers=[
   ("ord_encode", ord_encoder, ord_cols),
   ("one-hot", encoder, selector(dtype_include="object")),
   ]
  )
# create random forest estimator
rf_mod = RandomForestRegressor()

# create modeling pipeline
model_pipeline = Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("rf_mod", rf_mod),
])

# define loss function
loss = 'neg_root_mean_squared_error'

# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)

# fit model with 5-fold CV
results = cross_val_score(model_pipeline, X_train, y_train, cv=kfold, scoring=loss)

np.abs(np.mean(results))
## 27738.17557939341

In R we will want to use the ranger package as our random forest engine.

# create model recipe with all features
model_recipe <- recipe(
    Sale_Price ~ ., 
    data = ames_train
  )

# create bagged CART model object and
# start with 5 bagged trees
rf_mod <- rand_forest(mode = "regression") %>%
  set_engine("ranger")

# create resampling procedure
set.seed(13)
kfold <- vfold_cv(ames_train, v = 5)

# train model
results <- fit_resamples(rf_mod, model_recipe, kfold)

# model results
collect_metrics(results)
## # A tibble: 2 x 6
##   .metric .estimator      mean     n    std_err .config             
##   <chr>   <chr>          <dbl> <int>      <dbl> <chr>               
## 1 rmse    standard   26981.        5 1119.      Preprocessor1_Model1
## 2 rsq     standard       0.895     5    0.00862 Preprocessor1_Model1

Hyperparameters

Although random forests perform well out-of-the-box, there are several tunable hyperparameters that we should consider when training a model. Although we briefly discuss the main hyperparameters, Probst, Wright, and Boulesteix (2019) provide a much more thorough discussion. The main hyperparameters to consider include:

  1. The number of trees in the forest
  2. The number of features to consider at any given split: \(m_{try}\)
  3. The complexity of each tree
  4. The sampling scheme
  5. The splitting rule to use during tree construction

From this list (1) and (2) typically have the largest impact on predictive accuracy and should always be tuned. (3) and (4) tend to have marginal impact on predictive accuracy but are still worth exploring. They also have the ability to influence computational efficiency. (5) tends to have the smallest impact on predictive accuracy and is used primarily to increase computational efficiency.

Number of trees

The first consideration is the number of trees within your random forest. Although not technically a hyperparameter, the number of trees needs to be sufficiently large to stabilize the error rate. A good rule of thumb is to start with 10 times the number of features as illustrated below); however, as you adjust other hyperparameters such as \(m_{try}\) and node size, more or fewer trees may be required. More trees provide more robust and stable error estimates and variable importance measures; however, the impact on computation time increases linearly with the number of trees.

Start with \(p \times 10\) trees and adjust as necessary.

Figure: The Ames data has 80 features and starting with 10 times the number of features typically ensures the error estimate converges.

Figure: The Ames data has 80 features and starting with 10 times the number of features typically ensures the error estimate converges.

\(m_{try}\)

The hyperparameter that controls the split-variable randomization feature of random forests is often referred to as \(m_{try}\) and it helps to balance low tree correlation with reasonable predictive strength. With regression problems the default value is often \(m_{try} = \frac{p}{3}\) and for classification \(m_{try} = \sqrt{p}\). However, when there are fewer relevant predictors (e.g., noisy data) a higher value of \(m_{try}\) tends to perform better because it makes it more likely to select those features with the strongest signal. When there are many relevant predictors, a lower \(m_{try}\) might perform better.

Start with five evenly spaced values of \(m_{try}\) across the range 2–\(p\) centered at the recommended default as illustrated below. For the Ames data, an mtry value slightly lower (21) than the default (26) improves performance.

Figure: For the Ames data, an mtry value slightly lower (21) than the default (26) improves performance.

Figure: For the Ames data, an mtry value slightly lower (21) than the default (26) improves performance.

Tree complexity

Random forests are built on individual decision trees; consequently, most random forest implementations have one or more hyperparameters that allow us to control the depth and complexity of the individual trees. This will often include hyperparameters such as node size, max depth, max number of terminal nodes, or the required node size to allow additional splits. Node size is probably the most common hyperparameter to control tree complexity and most implementations use the default values of one for classification and five for regression as these values tend to produce good results (Dı́az-Uriarte and De Andres 2006; Goldstein, Polley, and Briggs 2011). However, Segal (2004) showed that if your data has many noisy predictors and higher \(m_{try}\) values are performing best, then performance may improve by increasing node size (i.e., decreasing tree depth and complexity). Moreover, if computation time is a concern then you can often decrease run time substantially by increasing the node size and have only marginal impacts to your error estimate as illustrated below.

When adjusting node size start with three values between 1–10 and adjust depending on impact to accuracy and run time. Increasing node size to reduce tree complexity will often have a larger impact on computation speed (right) than on your error estimate.

Figure: Increasing node size to reduce tree complexity will often have a larger impact on computation speed (right) than on your error estimate.

Figure: Increasing node size to reduce tree complexity will often have a larger impact on computation speed (right) than on your error estimate.

Sampling scheme

The default sampling scheme for random forests is bootstrapping where 100% of the observations are sampled with replacement (in other words, each bootstrap copy has the same size as the original training data); however, we can adjust both the sample size and whether to sample with or without replacement. The sample size parameter determines how many observations are drawn for the training of each tree. Decreasing the sample size leads to more diverse trees and thereby lower between-tree correlation, which can have a positive effect on the prediction accuracy. Consequently, if there are a few dominating features in your data set, reducing the sample size can also help to minimize between-tree correlation.

Also, when you have many categorical features with a varying number of levels, sampling with replacement can lead to biased variable split selection (Janitza, Binder, and Boulesteix 2016; Strobl et al. 2007). Consequently, if you have categories that are not balanced, sampling without replacement provides a less biased use of all levels across the trees in the random forest.

Assess 3–4 values of sample sizes ranging from 25%–100% and if you have unbalanced categorical features try sampling without replacement. The Ames data has several imbalanced categorical features such as neighborhood, zoning, overall quality, and more. Consequently, sampling without replacement appears to improve performance as it leads to less biased split variable selection and more uncorrelated trees.

Figure: The Ames data has several imbalanced categorical features such as neighborhood, zoning, overall quality, and more. Consequently, sampling without replacement appears to improve performance as it leads to less biased split variable selection and more uncorrelated trees.

Figure: The Ames data has several imbalanced categorical features such as neighborhood, zoning, overall quality, and more. Consequently, sampling without replacement appears to improve performance as it leads to less biased split variable selection and more uncorrelated trees.

Split rule

Recall the default splitting rule during random forests tree building consists of selecting, out of all splits of the (randomly selected \(m_{try}\)) candidate variables, the split that minimizes the Gini impurity (in the case of classification) and the SSE (in case of regression). However, Strobl et al. (2007) illustrated that these default splitting rules favor the selection of features with many possible splits (e.g., continuous variables or categorical variables with many categories) over variables with fewer splits (the extreme case being binary variables, which have only one possible split). Conditional inference trees (Hothorn, Hornik, and Zeileis 2006) implement an alternative splitting mechanism that helps to reduce this variable selection bias.2 However, ensembling conditional inference trees has yet to be proven superior with regards to predictive accuracy and they take a lot longer to train.

To increase computational efficiency, splitting rules can be randomized where only a random subset of possible splitting values is considered for a variable (Geurts, Ernst, and Wehenkel 2006). If only a single random splitting value is randomly selected then we call this procedure extremely randomized trees. Due to the added randomness of split points, this method tends to have no improvement, or often a negative impact, on predictive accuracy.

Regarding runtime, extremely randomized trees are the fastest as the cutpoints are drawn completely randomly, followed by the classical random forest, while for conditional inference forests the runtime is the largest (Probst, Wright, and Boulesteix 2019).

If you need to increase computation time significantly try completely randomized trees; however, be sure to assess predictive accuracy to traditional split rules as this approach often has a negative impact on your loss function.

Tuning strategies

Feature interpretation

Computing feature importance and feature effects for random forests follow the same procedure as discussed in the bagging module. However, in addition to the impurity-based measure of feature importance where we base feature importance on the average total reduction of the loss function for a given feature across all trees, random forests also typically include a permutation-based importance measure. In the permutation-based approach, for each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. The decrease in accuracy as a result of this randomly shuffling of feature values is averaged over all the trees for each predictor. The variables with the largest average decrease in accuracy are considered most important.

Once you’ve identified the optimal parameter values from the grid search, you will want to re-run your model with these hyperparameter values and crank up the number of trees, which will help create more stables values of variable importance.

The following shows how to plot the default feature importance values provided by RandomForestRegressor, which is based averaging the decrease in impurity over all the trees in the random forest model.

For a deep dive into the different ways to compute feature importance for random forests in Python check out this article.

# create final model object
X_encoded = preprocessor.fit_transform(X_train)
final_model = RandomForestRegressor(
  n_estimators=1000,
  max_features=0.21,
  max_samples=0.65,
  min_samples_leaf=1,
  bootstrap=False
)
final_model_fit = final_model.fit(X_encoded, y_train)

# extract feature importances
vi = pd.DataFrame({'feature': preprocessor.get_feature_names(),
                   'importance': final_model_fit.feature_importances_})

# get top 20 influential features
top_20_features = vi.nlargest(20, 'importance')

# plot feature importance
(ggplot(top_20_features, aes(x='importance', y='reorder(feature, importance)'))
 + geom_point()
 + labs(y=None))
## <ggplot: (317426902)>

And as we have done in the past we can build upon this to assess how some of our important features influence the predicted values. For example, the following looks at the third most influential feature (Gr_Liv_Area) and produces a partial dependence plot.

X_encoded = pd.DataFrame(X_encoded, columns=preprocessor.get_feature_names())
pd_results = partial_dependence(
  final_model_fit, X_encoded, "Gr_Liv_Area", kind='average',
  percentiles=(0, 1)) 
  
pd_output = pd.DataFrame({'Gr_Liv_Area': pd_results['values'][0],
                          'yhat': pd_results['average'][0]})
                          
(ggplot(pd_output, aes('Gr_Liv_Area', 'yhat'))
  + geom_line())
## <ggplot: (320190787)>

Once we have our (near) optimal model we can use those parameters and re-run our model. For simplicity, we can re-run our model using the ranger() from the ranger package. This is the same engine we were using with tidymodels but since we are simply trying to understand variable importance we don’t need to worry about cross-validation.

# re-run model with impurity-based variable importance
rf_impurity <- ranger(
  formula = Sale_Price ~ ., 
  data = ames_train, 
  num.trees = 2000,
  mtry = 26,
  min.node.size = 1,
  sample.fraction = .5,
  replace = FALSE,
  importance = "impurity",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed  = 123
)

# re-run model with permutation-based variable importance
rf_permutation <- ranger(
  formula = Sale_Price ~ ., 
  data = ames_train, 
  num.trees = 2000,
  mtry = 26,
  min.node.size = 1,
  sample.fraction = .5,
  replace = FALSE,
  importance = "permutation",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed  = 123
)

The resulting VIPs are displayed below. Typically, you will not see the same variable importance order between the two options; however, you will often see similar variables at the top of the plots (and also the bottom). Consequently, in this example, we can comfortably state that there appears to be enough evidence to suggest that three variables stand out as most influential:

  • Overall_Qual
  • Gr_Liv_Area
  • Neighborhood

Looking at the next ~10 variables in both plots, you will also see some commonality in influential variables (e.g., Garage_Cars, Exter_Qual, Total_Bsmt_SF, and Year_Built).

p1 <- vip::vip(rf_impurity, num_features = 25, bar = FALSE)
p2 <- vip::vip(rf_permutation, num_features = 25, bar = FALSE)

gridExtra::grid.arrange(p1, p2, nrow = 1)
Top 25 most important variables based on impurity (left) and permutation (right).

Top 25 most important variables based on impurity (left) and permutation (right).

Final thoughts

Random forests provide a very powerful out-of-the-box algorithm that often has great predictive accuracy. They come with all the benefits of decision trees (with the exception of surrogate splits) and bagging but greatly reduce instability and between-tree correlation. And due to the added split variable selection attribute, random forests are also faster than bagging as they have a smaller feature search space at each tree split. However, random forests will still suffer from slow computational speed as your data sets get larger but, similar to bagging, the algorithm is built upon independent steps, and most modern implementations allow for parallelization to improve training time.

Exercises

Using the Boston housing data set, where the response feature is the median value of homes within a census tract (cmedv):

  1. Apply a default random forest model with the same features you used in the bagging module. How does the out-of-the-box random forest model perform compared to the bagging module?
  2. Assess the number of trees in your random forest model.
    • How many trees are applied?
    • Was it enough to stabilize the loss function or do you need to add more?
  3. Perform a full cartesian grid search across various values of:
    • \(m_{try}\)
    • tree complexity (i.e. max depth, node size)
    • sampling scheme
  4. How long did the above grid search take? Which model gave the best performance?
  5. Now run a random grid search across the same hyperparameter grid but restrict the time or number of models to run to 50% of the models ran in the full cartesian. How does the random grid search results compare?
  6. Pick your best random forest model. Which 10 features are considered most influential? Are these the same features that have been influential in previous models?
  7. Create partial dependence plots for the top two most influential features. Explain the relationship between the feature and the predicted values.
  8. Now perform 1-7 to the Attrition dataset, which is classification model rather than a regression model.

🏠

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.
Dı́az-Uriarte, Ramón, and Sara Alvarez De Andres. 2006. “Gene Selection and Classification of Microarray Data Using Random Forest.” BMC Bioinformatics 7 (1): 3.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer Series in Statistics New York, NY, USA:
Geurts, Pierre, Damien Ernst, and Louis Wehenkel. 2006. “Extremely Randomized Trees.” Machine Learning 63 (1): 3–42.
Goldstein, Benjamin A, Eric C Polley, and Farren BS Briggs. 2011. “Random Forests for Genetic Association Studies.” Statistical Applications in Genetics and Molecular Biology 10 (1).
Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3): 651–74.
Hothorn, Torsten, and Achim Zeileis. 2015. “Partykit: A Modular Toolkit for Recursive Partytioning in r.” The Journal of Machine Learning Research 16 (1): 3905–9.
Janitza, Silke, Harald Binder, and Anne-Laure Boulesteix. 2016. “Pitfalls of Hypothesis Tests and Model Selection on Bootstrap Samples: Causes and Consequences in Biometrical Applications.” Biometrical Journal 58 (3): 447–73.
Probst, Philipp, Bernd Bischl, and Anne-Laure Boulesteix. 2018. “Tunability: Importance of Hyperparameters of Machine Learning Algorithms.” arXiv Preprint arXiv:1802.09596.
Probst, Philipp, Marvin N Wright, and Anne-Laure Boulesteix. 2019. “Hyperparameters and Tuning Strategies for Random Forest.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1301.
Segal, Mark R. 2004. “Machine Learning Benchmarks and Random Forest Regression.” UCSF: Center for Bioinformatics and Molecular Biostatistics.
Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (1): 25.

  1. See Friedman, Hastie, and Tibshirani (2001) for a mathematical explanation of the tree correlation phenomenon.↩︎

  2. Conditional inference trees are available in the partykit (Hothorn and Zeileis 2015) and ranger packages in R; however, there is not a current implementation available in Python.↩︎