In an earlier module we learned about bootstrapping as a resampling procedure, which creates b new bootstrap samples by drawing samples with replacement of the original training data. This module illustrates how we can use bootstrapping to create an ensemble of predictions. Bootstrap aggregating, also called bagging, is one of the first ensemble algorithms¹ machine learning practitioners learn and is designed to improve the stability and accuracy of regression and classification algorithms. By model averaging, bagging helps to reduce variance and minimize overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method.

Learning objectives

By the end of this module you will know:

How and why bagging improves decision tree performance.
How to implement bagging and ensure you are optimizing bagging performance.
How to identify influential features and their effects on the response variable.

Prerequisites

# Helper packages
import numpy as np
import pandas as pd
from plotnine import *

# Modeling packages
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import train_test_split
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import partial_dependence
from sklearn.pipeline import Pipeline

# Ames housing data
ames = pd.read_csv("data/ames.csv")

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]

# Helper packages
library(tidyverse)   # for data wrangling & plotting

# Modeling packages
library(tidymodels) 
library(baguette)

# Model interpretability packages
library(vip)         # for variable importance
library(pdp)         # for variable relationships

# Stratified sampling with the rsample package
set.seed(123)
ames <- AmesHousing::make_ames()
split  <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- rsample::training(split)
ames_test   <- rsample::testing(split)

Why and when bagging works

Bootstrap aggregating (bagging) prediction models is a general method for fitting multiple versions of a prediction model and then combining (or ensembling) them into an aggregated prediction (Breiman 1996). Bagging is a fairly straight forward algorithm in which b bootstrap copies of the original training data are created, the regression or classification algorithm (commonly referred to as the base learner) is applied to each bootstrap sample and, in the regression context, new predictions are made by averaging the predictions together from the individual base learners. When dealing with a classification problem, the base learner predictions are combined using plurality vote or by averaging the estimated class probabilities together. This is represented in the below equation where $X$ is the record for which we want to generate a prediction, $\widehat{f_{bag}}$ is the bagged prediction, and $\widehat{f_1}\left(X\right), \widehat{f_2}\left(X\right), \dots, \widehat{f_b}\left(X\right)$ are the predictions from the individual base learners.

\[\begin{equation} \widehat{f_{bag}} = \widehat{f_1}\left(X\right) + \widehat{f_2}\left(X\right) + \cdots + \widehat{f_b}\left(X\right) \end{equation}\]

Because of the aggregation process, bagging effectively reduces the variance of an individual base learner (i.e., averaging reduces variance); however, bagging does not always improve upon an individual base learner. As discussed in our bias vs variance discussion, some models have larger variance than others. Bagging works especially well for unstable, high variance base learners—algorithms whose predicted output undergoes major changes in response to small changes in the training data (Dietterich 2000b, 2000a). This includes algorithms such as decision trees and KNN (when k is sufficiently small). However, for algorithms that are more stable or have high bias, bagging offers less improvement on predicted outputs since there is less variability (e.g., bagging a linear regression model will effectively just return the original predictions for large enough $b$).

The general idea behind bagging is referred to as the “wisdom of the crowd” effect and was popularized by Surowiecki (2005). It essentially means that the aggregation of information in large diverse groups results in decisions that are often better than could have been made by any single member of the group. The more diverse the group members are then the more diverse their perspectives and predictions will be, which often leads to better aggregated information. Think of estimating the number of jelly beans in a jar at a carinival. While any individual guess is likely to be way off, you’ll often find that the averaged guesses tends to be a lot closer to the true number.

This is illustrated in the below plot, which compares bagging $b = 100$ polynomial regression models, MARS models, and CART decision trees. You can see that the low variance base learner (polynomial regression) gains very little from bagging while the higher variance learner (decision trees) gains significantly more. Not only does bagging help minimize the high variability (instability) of single trees, but it also helps to smooth out the prediction surface.

Optimal performance is often found by bagging 50–500 trees. Data sets that have a few strong predictors typically require less trees; whereas data sets with lots of noise or multiple strong predictors may need more. Using too many trees will not lead to overfitting. However, it’s important to realize that since multiple models are being run, the more iterations you perform the more computational and time requirements you will have. As these demands increase, performing k-fold CV can become computationally burdensome.

A benefit to creating ensembles via bagging, which is based on resampling with replacement, is that it can provide its own internal estimate of predictive performance with the out-of-bag (OOB) sample. The OOB sample can be used to test predictive performance and the results usually compare well compared to k-fold CV assuming your data set is sufficiently large (say $n \geq 1,000$). Consequently, as your data sets become larger and your bagging iterations increase, it is common to use the OOB error estimate as a proxy for predictive performance.

Think of the OOB estimate of generalization performance as an unstructured, but free CV statistic.

Implementation

Basic implementation

Recall in the decision tree module that our optimally tuned single decision tree model was obtaining a mid-$30K RMSE. In this first implementation we’ll perform the same process; however, instead of a single decision tree we’ll use bagging to create and aggregate performance across 5 bagged trees. In both examples (R & Python) we improve performance over the individual decision tree as our RMSE is around $30K.

In Python we use BaggingRegressor and BaggingClassifier to performing bagging and we specify the number of bagged estimators (trees in this example) with n_estimators. However, in this example we are going to ordinal encode our Quality/Condition features (i.e. Overall_Qual, Garage_Qual, Kitchen_Qual) and, as usually, we need to one-hot encode our remaining nominal features.

# Ordinal encode our quality-based features 
ord_cols = list(X_train.filter(regex=("Qual$|QC$|Cond$")).columns)
lvs = ["Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical", 
       "Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"]
val = range(0, len(lvs))
lvl_map = dict(zip(lvs, val))
category_mapping = [{'col': col, 'mapping': lvl_map} for col in ord_cols]
ord_encoder = OrdinalEncoder(cols=ord_cols, mapping=category_mapping)

# one hot encode remaining nominal features
encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)

# combine into a pre-processing pipeline
preprocessor = ColumnTransformer(
  remainder="passthrough",
  transformers=[
   ("ord_encode", ord_encoder, ord_cols),
   ("one-hot", encoder, selector(dtype_include="object")),
   ]
  )

# create bagging estimator
dt_bag = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=5)

# create modeling pipeline
model_pipeline = Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("dt_bag", dt_bag),
])

# define loss function
loss = 'neg_root_mean_squared_error'

# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)

# fit model with 5-fold CV
results = cross_val_score(model_pipeline, X_train, y_train, cv=kfold, scoring=loss)

np.abs(np.mean(results))
## 31247.774043823047

In R we use baguette::bag_tree, which can be applied to decision tree or MARS models.

# create model recipe with all features
model_recipe <- recipe(
    Sale_Price ~ ., 
    data = ames_train
  )

# create bagged CART model object and
# start with 5 bagged trees
tree_mod <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("regression")

# create resampling procedure
set.seed(13)
kfold <- vfold_cv(ames_train, v = 5)

# train model
results <- fit_resamples(tree_mod, model_recipe, kfold)

# model results
collect_metrics(results)
## # A tibble: 2 x 6
##   .metric .estimator      mean     n   std_err .config             
##   <chr>   <chr>          <dbl> <int>     <dbl> <chr>               
## 1 rmse    standard   29850.        5 1270.     Preprocessor1_Model1
## 2 rsq     standard       0.861     5    0.0145 Preprocessor1_Model1

Tuning

One thing to note is that typically, the more trees the better. As we add more trees we’re averaging over more high variance decision trees. Early on, we see a dramatic reduction in variance (and hence our error) but eventually the error will typically flatline and stabilize signaling that a suitable number of trees has been reached. Often, we need only 50–100 trees to stabilize the error (in other cases we may need 500 or more).

Consequently, we can treat the number of trees as a tuning parameter where are focus is to make sure we are using enough trees for our error metric to converge at a optimal value (minimize RMSE in our case). However, rarely do we run the risk of including too many trees as we typically will not see our error metric degrade after adding more trees than necessary.

The primary concern is compute efficiency, we want enough trees to reach an optimal resampling error; however, not more trees than necessary as it will slow down computational efficiency.

In this example we use a hyperparameter grid to assess 5, 25, 50, 100, and 200 trees. In both examples we see an initial decrease but as we start using 50+ trees we only see marginal changes in our RMSE, suggesting we have likely reached enough trees to minimize the RMSE and any additional differences are likely due to the standard error that occurs during the bootstrapping process.

# create bagging estimator but with undefined number of bagged trees
dt_bag = BaggingRegressor(base_estimator=DecisionTreeRegressor())

# Create grid of hyperparameter values
hyper_grid = {'dt_bag__n_estimators': [5, 25, 50, 100, 200]}

# Tune a knn model using grid search
grid_search = GridSearchCV(model_pipeline, hyper_grid, cv=kfold, scoring=loss, n_jobs=-1)
results = grid_search.fit(X_train, y_train)

np.abs(results.cv_results_['mean_test_score'])
## array([30728.55446654, 28664.87222157, 28367.19060724, 27822.65237361,
##        27912.88097578])

# Best model's cross validated RMSE
abs(results.best_score_)
## 27822.65237360882

# Best model's number of trees
n_trees = results.best_estimator_.get_params().get('dt_bag__n_estimators')
n_trees
## 100

# create model recipe with all features
model_recipe <- recipe(
    Sale_Price ~ ., 
    data = ames_train
  )

# create bagged CART model object with
# tuning option set for number of bagged trees
tree_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("regression")

# create the hyperparameter grid
hyper_grid <- expand.grid(times = c(5, 25, 50, 100, 200))

# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(tree_mod, model_recipe, resamples = kfold, grid = hyper_grid)


# model results
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
##   times .metric .estimator   mean     n std_err .config             
##   <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1   200 rmse    standard   26376.     5    900. Preprocessor1_Model5
## 2   100 rmse    standard   26526.     5    904. Preprocessor1_Model4
## 3    25 rmse    standard   26713.     5   1037. Preprocessor1_Model2
## 4    50 rmse    standard   26803.     5   1179. Preprocessor1_Model3
## 5     5 rmse    standard   30405.     5   1152. Preprocessor1_Model1

Feature interpretation

Unfortunately, due to the bagging process, models that are normally perceived as interpretable are no longer so. However, we can still make inferences about how features are influencing our model. Recall in the decision tree module that we measure feature importance based on the sum of the reduction in the loss function (e.g., SSE) attributed to each variable at each split in a given tree.

For bagged decision trees, this process is similar. For each tree, we compute the sum of the reduction of the loss function across all splits. We then aggregate this measure across all trees for each feature. The features with the largest average decrease in SSE (for regression) are considered most important.

The following extracts the feature importance for our best performing bagged decision trees. In both cases we see that Overall_Qual is the most influential feature. We can then plot the partial dependence of this feature to see how it influences the predicted values. In both examples we see that as overall quality increase we see a significant increase in the predicted sale price.

# create final model object
X_encoded = preprocessor.fit_transform(X_train)
dt_bag = BaggingRegressor(
  base_estimator=DecisionTreeRegressor(), 
  n_estimators=n_trees
  )
dt_bag_fit = dt_bag.fit(X_encoded, y_train)

# extract feature importances
feature_importances = [tree.feature_importances_ for tree in dt_bag_fit.estimators_ ]
avg_feature_importances = np.mean(feature_importances, axis=0)
vi = pd.DataFrame({'feature': preprocessor.get_feature_names(),
                   'importance': avg_feature_importances})

# get top 20 influential features
top_20_features = vi.nlargest(20, 'importance')

# plot feature importance
(ggplot(top_20_features, aes(x='importance', y='reorder(feature, importance)'))
 + geom_point()
 + labs(y=None))

X_encoded = pd.DataFrame(X_encoded, columns=preprocessor.get_feature_names())
pd_results = partial_dependence(
  dt_bag_fit, X_encoded, "ord_encode__Overall_Qual", kind='average',
  percentiles=(0, 1)) 
  
pd_output = pd.DataFrame({'ord_encode__Overall_Qual': pd_results['values'][0],
                          'yhat': pd_results['average'][0]})
                          
(ggplot(pd_output, aes('ord_encode__Overall_Qual', 'yhat'))
  + geom_line())

# identify best model
lowest_rmse <- results %>%
  select_best("rmse")

# finalize best model...
final_model <- bag_tree() %>%
 set_engine("rpart", times = lowest_rmse$times) %>%
 set_mode("regression")

# ...and fit
final_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(final_model) %>%
  fit(data = ames_train)

# plot top 20 influential variables
vi <- final_fit$fit$fit$fit$imp %>%
 slice_max(order_by = value, n = 20)

ggplot(vi, aes(value, reorder(term, value))) +
 geom_col() +
 ylab(NULL) +
 xlab("Feature importance")

# prediction function
pdp_pred_fun <- function(object, newdata) {
  predict(object, newdata, type = "numeric")$.pred
}

# use the pdp package to extract partial dependence predictions
# and then plot
final_fit %>%
  pdp::partial(
   pred.var = "Overall_Qual", 
   pred.fun = pdp_pred_fun,
   grid.resolution = 10, 
   train = ames_train
  ) %>%
  ggplot(aes(Overall_Qual, yhat)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::dollar)

Final thoughts

Bagging improves the prediction accuracy for high variance (and low bias) models at the expense of interpretability and computational speed. However, using various interpretability algorithms such as VIPs and PDPs, we can still make inferences about how our bagged model leverages feature information. Also, since bagging consists of independent processes, the algorithm is easily parallelizable.

However, when bagging trees, a problem still exists. Although the model building steps are independent, the trees in bagging are not completely independent of each other since all the original features are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to any underlying strong relationships.

For example, if we create six decision trees with different bootstrapped samples of the Boston housing data (Harrison Jr and Rubinfeld 1978), we see a similar structure as the top of the trees. Although there are 15 predictor variables to split on, all six trees have both lstat and rm variables driving the first few splits.

We use the Boston housing data in this example because it has fewer features and shorter names than the Ames housing data. Consequently, it is easier to compare multiple trees side-by-side; however, the same tree correlation problem exists in the Ames bagged model.

Figure: Six decision trees based on different bootstrap samples.

Exercises

Using the Boston housing data set, where the response feature is the median value of homes within a census tract (cmedv):

Apply a bagged decision tree model with all features.
How many trees are required before the loss function stabilizes?
Adjust different tuning parameters and assess how performance changes.
How does the model performance compare to the decision tree model applied in the previous module’s exercise?
Which 10 features are considered most influential? Are these the same features that have been influential in previous model?
Create partial dependence plots for the top two most influential features. Explain the relationship between the feature and the predicted values.
Now perform 1-6 to the Attrition dataset, which is classification model rather than a regression model.

🏠

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40.

Dietterich, Thomas G. 2000a. “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization.” Machine Learning 40 (2): 139–57.

———. 2000b. “Ensemble Methods in Machine Learning.” In International Workshop on Multiple Classifier Systems, 1–15. Springer.

Harrison Jr, David, and Daniel L Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102.

Surowiecki, James. 2005. The Wisdom of Crowds. Anchor.

Also commonly referred to as a meta-algorithm.↩︎