Random forests are a modification of bagged decision trees that build a large collection of de-correlated trees to further improve predictive performance. They have become a very popular “out-of-the-box” or “off-the-shelf” learning algorithm that enjoys good predictive performance with relatively little hyperparameter tuning. Many modern implementations of random forests exist; however, Leo Breiman’s algorithm (Breiman 2001) has largely become the authoritative procedure. This module will cover the fundamentals of random forests.
By the end of this module you will know:
# Helper packages
import numpy as np
import pandas as pd
from plotnine import *
from scipy.stats import uniform
from scipy.stats import randint
# Modeling packages
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.inspection import partial_dependence
from sklearn.pipeline import Pipeline
# Ames housing data
ames = pd.read_csv("data/ames.csv")
# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)
# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]
# Helper packages
library(tidyverse) # for data wrangling & plotting
# Modeling packages
library(tidymodels)
# Model interpretability packages
library(vip) # for variable importance
library(pdp) # for variable relationships
# Stratified sampling with the rsample package
set.seed(123)
ames <- AmesHousing::make_ames()
split <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- rsample::training(split)
ames_test <- rsample::testing(split)
Random forests are built using the same fundamental principles as decision trees and bagging. Bagging trees introduces a random component into the tree building process by building many trees on bootstrapped copies of the training data. Bagging then aggregates the predictions across all the trees; this aggregation reduces the variance of the overall procedure and results in improved predictive performance. However, as we saw in the last module, simply bagging trees results in tree correlation that limits the effect of variance reduction.
Random forests help to reduce tree correlation by injecting more randomness into the tree-growing process.1 More specifically, while growing a decision tree during the bagging process, random forests perform split-variable randomization where each time a split is to be performed, the search for the split variable is limited to a random subset of \(m_{try}\) of the original \(p\) features. Typical default values are \(m_{try} = \frac{p}{3}\) (regression) and \(m_{try} = \sqrt{p}\) (classification) but this should be considered a tuning parameter.
The basic algorithm for a regression or classification random forest can be generalized as follows:
1. Given a training data set
2. Select number of trees to build (n_trees)
3. for i = 1 to n_trees do
4. | Generate a bootstrap sample of the original data
5. | Grow a regression/classification tree to the bootstrapped data
6. | for each split do
7. | | Select m_try variables at random from all p variables
8. | | Pick the best variable/split-point among the m_try
9. | | Split the node into two child nodes
10. | end
11. | Use typical tree model stopping criteria to determine when a
| tree is complete (but do not prune)
12. end
13. Output ensemble of trees
When \(m_{try} = p\), the algorithm is equivalent to bagging decision trees.
Since the algorithm randomly selects a bootstrap sample to train on and a random sample of features to use at each split, a more diverse set of trees is produced which tends to lessen tree correlation beyond bagged trees and often dramatically increase predictive power.
Random forests have become popular because they tend to provide very good out-of-the-box performance. Although they have several hyperparameters that can be tuned, the default values tend to produce good results. Moreover, Probst, Bischl, and Boulesteix (2018) illustrated that among the more popular machine learning algorithms, random forests have the least variability in their prediction accuracy when tuning.
For example, if we train a random forest model with all hyperparameters set to their default values, we get RMSEs comparable to some of the best model’s we’ve run thus far (without any tuning).
In Python we use RandomForestRegressor and RandomForestClassifier to performing random forest models. Similar to the bagging example in the last module we are going to ordinal encode our Quality/Condition features (i.e. Overall_Qual, Garage_Qual, Kitchen_Qual) and, as usually, we need to one-hot encode our remaining nominal features.
# Ordinal encode our quality-based features
ord_cols = list(X_train.filter(regex=("Qual$|QC$|Cond$")).columns)
lvs = ["Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical",
"Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"]
val = range(0, len(lvs))
lvl_map = dict(zip(lvs, val))
category_mapping = [{'col': col, 'mapping': lvl_map} for col in ord_cols]
ord_encoder = OrdinalEncoder(cols=ord_cols, mapping=category_mapping)
# one hot encode remaining nominal features
encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
# combine into a pre-processing pipeline
preprocessor = ColumnTransformer(
remainder="passthrough",
transformers=[
("ord_encode", ord_encoder, ord_cols),
("one-hot", encoder, selector(dtype_include="object")),
]
)
# create random forest estimator
rf_mod = RandomForestRegressor()
# create modeling pipeline
model_pipeline = Pipeline(steps=[
("preprocessor", preprocessor),
("rf_mod", rf_mod),
])
# define loss function
loss = 'neg_root_mean_squared_error'
# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)
# fit model with 5-fold CV
results = cross_val_score(model_pipeline, X_train, y_train, cv=kfold, scoring=loss)
np.abs(np.mean(results))
## 27738.17557939341
In R we will want to use the ranger package as our random forest engine.
# create model recipe with all features
model_recipe <- recipe(
Sale_Price ~ .,
data = ames_train
)
# create bagged CART model object and
# start with 5 bagged trees
rf_mod <- rand_forest(mode = "regression") %>%
set_engine("ranger")
# create resampling procedure
set.seed(13)
kfold <- vfold_cv(ames_train, v = 5)
# train model
results <- fit_resamples(rf_mod, model_recipe, kfold)
# model results
collect_metrics(results)
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 26981. 5 1119. Preprocessor1_Model1
## 2 rsq standard 0.895 5 0.00862 Preprocessor1_Model1
Although random forests perform well out-of-the-box, there are several tunable hyperparameters that we should consider when training a model. Although we briefly discuss the main hyperparameters, Probst, Wright, and Boulesteix (2019) provide a much more thorough discussion. The main hyperparameters to consider include:
From this list (1) and (2) typically have the largest impact on predictive accuracy and should always be tuned. (3) and (4) tend to have marginal impact on predictive accuracy but are still worth exploring. They also have the ability to influence computational efficiency. (5) tends to have the smallest impact on predictive accuracy and is used primarily to increase computational efficiency.
The first consideration is the number of trees within your random forest. Although not technically a hyperparameter, the number of trees needs to be sufficiently large to stabilize the error rate. A good rule of thumb is to start with 10 times the number of features as illustrated below); however, as you adjust other hyperparameters such as \(m_{try}\) and node size, more or fewer trees may be required. More trees provide more robust and stable error estimates and variable importance measures; however, the impact on computation time increases linearly with the number of trees.
Start with \(p \times 10\) trees and adjust as necessary.
The hyperparameter that controls the split-variable randomization feature of random forests is often referred to as \(m_{try}\) and it helps to balance low tree correlation with reasonable predictive strength. With regression problems the default value is often \(m_{try} = \frac{p}{3}\) and for classification \(m_{try} = \sqrt{p}\). However, when there are fewer relevant predictors (e.g., noisy data) a higher value of \(m_{try}\) tends to perform better because it makes it more likely to select those features with the strongest signal. When there are many relevant predictors, a lower \(m_{try}\) might perform better.
Start with five evenly spaced values of \(m_{try}\) across the range 2–\(p\) centered at the recommended default as illustrated below. For the Ames data, an mtry value slightly lower (21) than the default (26) improves performance.
Random forests are built on individual decision trees; consequently, most random forest implementations have one or more hyperparameters that allow us to control the depth and complexity of the individual trees. This will often include hyperparameters such as node size, max depth, max number of terminal nodes, or the required node size to allow additional splits. Node size is probably the most common hyperparameter to control tree complexity and most implementations use the default values of one for classification and five for regression as these values tend to produce good results (Dı́az-Uriarte and De Andres 2006; Goldstein, Polley, and Briggs 2011). However, Segal (2004) showed that if your data has many noisy predictors and higher \(m_{try}\) values are performing best, then performance may improve by increasing node size (i.e., decreasing tree depth and complexity). Moreover, if computation time is a concern then you can often decrease run time substantially by increasing the node size and have only marginal impacts to your error estimate as illustrated below.
When adjusting node size start with three values between 1–10 and adjust depending on impact to accuracy and run time. Increasing node size to reduce tree complexity will often have a larger impact on computation speed (right) than on your error estimate.
The default sampling scheme for random forests is bootstrapping where 100% of the observations are sampled with replacement (in other words, each bootstrap copy has the same size as the original training data); however, we can adjust both the sample size and whether to sample with or without replacement. The sample size parameter determines how many observations are drawn for the training of each tree. Decreasing the sample size leads to more diverse trees and thereby lower between-tree correlation, which can have a positive effect on the prediction accuracy. Consequently, if there are a few dominating features in your data set, reducing the sample size can also help to minimize between-tree correlation.
Also, when you have many categorical features with a varying number of levels, sampling with replacement can lead to biased variable split selection (Janitza, Binder, and Boulesteix 2016; Strobl et al. 2007). Consequently, if you have categories that are not balanced, sampling without replacement provides a less biased use of all levels across the trees in the random forest.
Assess 3–4 values of sample sizes ranging from 25%–100% and if you have unbalanced categorical features try sampling without replacement. The Ames data has several imbalanced categorical features such as neighborhood, zoning, overall quality, and more. Consequently, sampling without replacement appears to improve performance as it leads to less biased split variable selection and more uncorrelated trees.
Recall the default splitting rule during random forests tree building consists of selecting, out of all splits of the (randomly selected \(m_{try}\)) candidate variables, the split that minimizes the Gini impurity (in the case of classification) and the SSE (in case of regression). However, Strobl et al. (2007) illustrated that these default splitting rules favor the selection of features with many possible splits (e.g., continuous variables or categorical variables with many categories) over variables with fewer splits (the extreme case being binary variables, which have only one possible split). Conditional inference trees (Hothorn, Hornik, and Zeileis 2006) implement an alternative splitting mechanism that helps to reduce this variable selection bias.2 However, ensembling conditional inference trees has yet to be proven superior with regards to predictive accuracy and they take a lot longer to train.
To increase computational efficiency, splitting rules can be randomized where only a random subset of possible splitting values is considered for a variable (Geurts, Ernst, and Wehenkel 2006). If only a single random splitting value is randomly selected then we call this procedure extremely randomized trees. Due to the added randomness of split points, this method tends to have no improvement, or often a negative impact, on predictive accuracy.
Regarding runtime, extremely randomized trees are the fastest as the cutpoints are drawn completely randomly, followed by the classical random forest, while for conditional inference forests the runtime is the largest (Probst, Wright, and Boulesteix 2019).
If you need to increase computation time significantly try completely randomized trees; however, be sure to assess predictive accuracy to traditional split rules as this approach often has a negative impact on your loss function.
As we introduce more complex algorithms with greater number of hyperparameters, we should become more strategic with our tuning strategies. One way to become more strategic is to consider how we proceed through our grid search. Up to this point, all our grid searches have been full Cartesian grid searches where we assess every combination of hyperparameters of interest. We could continue to do the same; for example, the next code block searches across 120 combinations of hyperparameter settings.
To perform a regular cartesian grid search we create the hyperparameter grid and use GridSearchCV
as we have done in the past.
The following grid search results in a search over 120 different hyperparameter combinations, which results in a grid search time of 30 minutes!
# create random forest estimator with 1,000 trees
rf_mod = RandomForestRegressor(n_estimators=1000)
# create modeling pipeline
model_pipeline = Pipeline(steps=[
("preprocessor", preprocessor),
("rf_mod", rf_mod),
])
# Create grid of hyperparameter values
hyper_grid = {
'rf_mod__max_features': [.05, .15, .25, .333, .4],
'rf_mod__min_samples_leaf': [1, 3, 5, 10],
'rf_mod__bootstrap': [True, False],
'rf_mod__max_samples': [.5, .63, .8]
}
# Tune a knn model using grid search
grid_search = GridSearchCV(model_pipeline, hyper_grid, cv=kfold, scoring=loss, n_jobs=-1)
results = grid_search.fit(X_train, y_train)
# best model score
np.abs(results.best_score_)
## 25801.956501417742
# best hyperparameter values
results.best_params_
## {'rf_mod__bootstrap': False, 'rf_mod__max_features': 0.25, 'rf_mod__max_samples': 0.5, 'rf_mod__min_samples_leaf': 1}
To perform a grid search across various random forest hyperparameters we will use rand_forest()
with the engine set to "ranger"
, which is a fast C implementation of random forests. One thing you will note is that rand_forest()
accepts three hyperparameters; however, ranger allows for additional hyperparameters. We can pass additional hyperparameters that a specific “engine” accepts by including them in set_engine()
. This is why you see tuning parameters in both rand_forest()
and set_engine()
.
The following grid search results in a search over 120 different hyperparameter combinations, which results in a grid search time of 64 minutes!
# create model recipe with all features
model_recipe <- recipe(
Sale_Price ~ .,
data = ames_train
)
# create random forest model object with tuning option
rf_mod <- rand_forest(
mode = "regression",
trees = 1000,
mtry = tune(),
min_n = tune()
) %>%
set_engine(
"ranger",
replace = tune(),
sample.fraction = tune(),
respect.unordered.factors = 'order',
seed = 123
)
# create the hyperparameter grid
hyper_grid <- expand.grid(
mtry = floor(n_features * c(.05, .15, .25, .333, .4)),
min_n = c(1, 3, 5, 10),
replace = c(TRUE, FALSE),
sample.fraction = c(.5, .63, .8)
)
# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = hyper_grid)
# model results
show_best(results, metric = "rmse")
## # A tibble: 5 x 10
## mtry min_n replace sample.fraction .metric .estimator mean n std_err
## <dbl> <dbl> <lgl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 20 1 FALSE 0.8 rmse standard 25122. 5 1055.
## 2 26 1 FALSE 0.8 rmse standard 25170. 5 1045.
## 3 20 3 FALSE 0.8 rmse standard 25207. 5 1084.
## 4 26 3 FALSE 0.8 rmse standard 25213. 5 1049.
## 5 12 1 FALSE 0.8 rmse standard 25278. 5 1127.
## # … with 1 more variable: .config <chr>
Unfortunately cartesian grid searches become very slow as our hyperparameter combinations increase. An alternative is approach is to use a random grid search, which allows you to jump from one random combination of hyperparameters. Although using a random discrete search path will likely not find the optimal model, it typically does a good job of finding a very good model.
An even more advanced approach is to add early stopping rules that allow you to stop the grid search once a certain condition is met (e.g., a certain number of models have been trained, a certain runtime has elapsed, or the accuracy has stopped improving by a certain amount).
In Python we use RandomSearchCV
to search across n_iter
randomly sampled combinations of hyperparameter values. An important thing to note is that RandomSearchCV
uses a param_distributions
parameter, which is slightly different then the hyperparameter grid supplied in GridSearchCV
. In this case, we supply a dictionary of hyperparameter distributions to be sampled from. In this example, we use scipy.stats.uniform
to uniformly sample from a distribution for the max_features
, min_samples_leaf
, and max_samples
features.
The following searches across 20 randomly sampled value combinations of our hyperparameters and achieves near-similar results as the full grid search but in a fraction of the time.
# Create grid of hyperparameter values
hyper_distributions = {
'rf_mod__max_features': uniform(.05, .35),
'rf_mod__min_samples_leaf': randint(1, 9),
'rf_mod__bootstrap': [True, False],
'rf_mod__max_samples': uniform(.5, .3)
}
# Tune a knn model using grid search
random_search = RandomizedSearchCV(
model_pipeline,
param_distributions=hyper_distributions,
n_iter=20,
cv=kfold,
scoring=loss,
n_jobs=-1,
random_state=13
)
random_search_results = random_search.fit(X_train, y_train)
# best model score
np.abs(random_search_results.best_score_)
## 25747.449268184002
# best hyperparameter values
random_search_results.best_params_
## {'rf_mod__bootstrap': False, 'rf_mod__max_features': 0.2080219288763343, 'rf_mod__max_samples': 0.6534737243255395, 'rf_mod__min_samples_leaf': 1}
To perform a random grid search we can simply sample a certain percent of observations from our hyperparameter grid. The following samples 30% of our hyperparameter combinations and performs a grid search across these sampled parameters. We see that our results are close to our previous cartesian search.
# take a random subset of our hyperparameter grid
set.seed(123)
sampled_ids <- sample(1:nrow(hyper_grid), floor(nrow(hyper_grid)*.3))
sampled_grid <- hyper_grid[sampled_ids, ]
# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(rf_mod, model_recipe, resamples = kfold, grid = sampled_grid)
# model results
show_best(results, metric = "rmse")
## # A tibble: 5 x 10
## mtry min_n replace sample.fraction .metric .estimator mean n std_err
## <dbl> <dbl> <lgl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 26 1 FALSE 0.8 rmse standard 25170. 5 1045.
## 2 20 3 FALSE 0.8 rmse standard 25207. 5 1084.
## 3 12 1 FALSE 0.8 rmse standard 25278. 5 1127.
## 4 26 5 FALSE 0.8 rmse standard 25350. 5 1075.
## 5 32 5 FALSE 0.8 rmse standard 25415. 5 1065.
## # … with 1 more variable: .config <chr>
We could advance this random search by adding early stopping. In R there are some packages that provide built in early stopping (i.e. H2O, Xgboost). However, for tidymodels we need to manually build this option. There are many different approaches we could take this:
In this example, we stop after running at least 5 models and if we haven’t seen at least 10% improvement in the RMSE.
# create empty vector to hold RMSE results
results <- c()
for (row in 1:nrow(sampled_grid)) {
# train model with sampled grid parameters
rf_mod <- rand_forest(
mode = "regression",
trees = 1000,
mtry = sampled_grid[row, "mtry"],
min_n = sampled_grid[row, "min_n"]
) %>%
set_engine(
"ranger",
replace = sampled_grid[row, "replace"],
sample.fraction = sampled_grid[row, "sample.fraction"],
respect.unordered.factors = 'order',
seed = 123
)
# train model
fit_results <- fit_resamples(rf_mod, model_recipe, kfold)
# get RMSE and add to results vector
rmse <- collect_metrics(fit_results) %>% filter(.metric == "rmse") %>% pluck("mean")
results <- c(results, rmse)
# print results along the way
cat("Model", row, "--> RMSE ==", rmse, "\n")
# Test if the current rmse improves upon the past three 5 scores
# If not then stop the process
threshold <- min(results) * 0.9
if ((row >= 5) && (rmse > threshold)) {
break
}
}
## Model 1 --> RMSE == 25981.21
## Model 2 --> RMSE == 26597.73
## Model 3 --> RMSE == 27115.96
## Model 4 --> RMSE == 28625.3
## Model 5 --> RMSE == 26663.49
And we can identify the best model’s hyperparameter settings with:
sampled_grid[which.min(results), ]
## mtry min_n replace sample.fraction
## 24 26 1 FALSE 0.5
Computing feature importance and feature effects for random forests follow the same procedure as discussed in the bagging module. However, in addition to the impurity-based measure of feature importance where we base feature importance on the average total reduction of the loss function for a given feature across all trees, random forests also typically include a permutation-based importance measure. In the permutation-based approach, for each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. The decrease in accuracy as a result of this randomly shuffling of feature values is averaged over all the trees for each predictor. The variables with the largest average decrease in accuracy are considered most important.
Once you’ve identified the optimal parameter values from the grid search, you will want to re-run your model with these hyperparameter values and crank up the number of trees, which will help create more stables values of variable importance.
The following shows how to plot the default feature importance values provided by RandomForestRegressor
, which is based averaging the decrease in impurity over all the trees in the random forest model.
For a deep dive into the different ways to compute feature importance for random forests in Python check out this article.
# create final model object
X_encoded = preprocessor.fit_transform(X_train)
final_model = RandomForestRegressor(
n_estimators=1000,
max_features=0.21,
max_samples=0.65,
min_samples_leaf=1,
bootstrap=False
)
final_model_fit = final_model.fit(X_encoded, y_train)
# extract feature importances
vi = pd.DataFrame({'feature': preprocessor.get_feature_names(),
'importance': final_model_fit.feature_importances_})
# get top 20 influential features
top_20_features = vi.nlargest(20, 'importance')
# plot feature importance
(ggplot(top_20_features, aes(x='importance', y='reorder(feature, importance)'))
+ geom_point()
+ labs(y=None))
## <ggplot: (317426902)>
And as we have done in the past we can build upon this to assess how some of our important features influence the predicted values. For example, the following looks at the third most influential feature (Gr_Liv_Area
) and produces a partial dependence plot.
X_encoded = pd.DataFrame(X_encoded, columns=preprocessor.get_feature_names())
pd_results = partial_dependence(
final_model_fit, X_encoded, "Gr_Liv_Area", kind='average',
percentiles=(0, 1))
pd_output = pd.DataFrame({'Gr_Liv_Area': pd_results['values'][0],
'yhat': pd_results['average'][0]})
(ggplot(pd_output, aes('Gr_Liv_Area', 'yhat'))
+ geom_line())
## <ggplot: (320190787)>
Once we have our (near) optimal model we can use those parameters and re-run our model. For simplicity, we can re-run our model using the ranger()
from the ranger package. This is the same engine we were using with tidymodels but since we are simply trying to understand variable importance we don’t need to worry about cross-validation.
# re-run model with impurity-based variable importance
rf_impurity <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 2000,
mtry = 26,
min.node.size = 1,
sample.fraction = .5,
replace = FALSE,
importance = "impurity",
respect.unordered.factors = "order",
verbose = FALSE,
seed = 123
)
# re-run model with permutation-based variable importance
rf_permutation <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 2000,
mtry = 26,
min.node.size = 1,
sample.fraction = .5,
replace = FALSE,
importance = "permutation",
respect.unordered.factors = "order",
verbose = FALSE,
seed = 123
)
The resulting VIPs are displayed below. Typically, you will not see the same variable importance order between the two options; however, you will often see similar variables at the top of the plots (and also the bottom). Consequently, in this example, we can comfortably state that there appears to be enough evidence to suggest that three variables stand out as most influential:
Overall_Qual
Gr_Liv_Area
Neighborhood
Looking at the next ~10 variables in both plots, you will also see some commonality in influential variables (e.g., Garage_Cars
, Exter_Qual
, Total_Bsmt_SF
, and Year_Built
).
p1 <- vip::vip(rf_impurity, num_features = 25, bar = FALSE)
p2 <- vip::vip(rf_permutation, num_features = 25, bar = FALSE)
gridExtra::grid.arrange(p1, p2, nrow = 1)
Random forests provide a very powerful out-of-the-box algorithm that often has great predictive accuracy. They come with all the benefits of decision trees (with the exception of surrogate splits) and bagging but greatly reduce instability and between-tree correlation. And due to the added split variable selection attribute, random forests are also faster than bagging as they have a smaller feature search space at each tree split. However, random forests will still suffer from slow computational speed as your data sets get larger but, similar to bagging, the algorithm is built upon independent steps, and most modern implementations allow for parallelization to improve training time.
Using the Boston housing data set, where the response feature is the median value of homes within a census tract (cmedv
):