Tree-based models are a class of nonparametric algorithms that work by partitioning the feature space into a number of smaller (non-overlapping) regions with similar response values using a set of splitting rules. Predictions are obtained by fitting a simpler model (e.g., a constant like the average response value) in each region. Such divide-and-conquer methods can produce simple rules that are easy to interpret and visualize with tree diagrams. As we’ll see, decision trees offer many benefits; however, they typically lack in predictive performance compared to more complex algorithms like neural networks and MARS. However, future modules will discuss powerful ensemble algorithms—like random forests and gradient boosting machines—which are constructed by combining together many decision trees in a clever way. This module will provide you with a strong foundation in decision trees.

Learning objectives

By the end of this module you will know:

  • How decision tree models partition data and how the depth of a tree impacts performance.
  • Train, fit, tune and assess decision tree models.
  • Identify important features and visualize their influence on the response.

Prerequisites

# Helper packages
import numpy as np
import pandas as pd
from plotnine import *
from matplotlib import pyplot as plt

# Modeling packages
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import partial_dependence
# Ames housing data
ames = pd.read_csv("data/ames.csv")

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]

# Helper packages
library(tidyverse)   # for data wrangling & plotting

# Modeling packages
library(tidymodels) 

# Model interpretability packages
library(vip)         # for variable importance
library(pdp)         # for variable relationships
# Stratified sampling with the rsample package
set.seed(123)
ames <- AmesHousing::make_ames()
split  <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- rsample::training(split)
ames_test   <- rsample::testing(split)

Structure

There are many methodologies for constructing decision trees but the most well-known is the classification and regression tree (CART) algorithm proposed in Breiman (1984).1 A basic decision tree partitions the training data into homogeneous subgroups (i.e., groups with similar response values) and then fits a simple constant in each subgroup (e.g., the mean of the within group response values for regression). The subgroups (also called nodes) are formed recursively using binary partitions formed by asking simple yes-or-no questions about each feature (e.g., is age < 18?). This is done a number of times until a suitable stopping criteria is satisfied (e.g., a maximum depth of the tree is reached). After all the partitioning has been done, the model predicts the output based on (1) the average response values for all observations that fall in that subgroup (regression problem), or (2) the class that has majority representation (classification problem). For classification, predicted probabilities can be obtained using the proportion of each class within the subgroup.

What results is an inverted tree-like structure such as that in the below figure. In essence, our tree is a set of rules that allows us to make predictions by asking simple yes-or-no questions about each feature. For example, if the customer is loyal, has household income greater than $150,000, and is shopping in a store, the exemplar tree diagram below would predict that the customer will redeem a coupon.

Figure: Exemplar decision tree predicting whether or not a customer will redeem a coupon (yes or no) based on the customer's loyalty, household income, last month's spend, coupon placement, and shopping mode.

Figure: Exemplar decision tree predicting whether or not a customer will redeem a coupon (yes or no) based on the customer’s loyalty, household income, last month’s spend, coupon placement, and shopping mode.

We refer to the first subgroup at the top of the tree as the root node (this node contains all of the training data). The final subgroups at the bottom of the tree are called the terminal nodes or leaves. Every subgroup in between is referred to as an internal node. The connections between nodes are called branches.

Figure: Terminology of a decision tree.

Figure: Terminology of a decision tree.

Partitioning

As illustrated above, CART uses binary recursive partitioning (it’s recursive because each split or rule depends on the the splits above it). The objective at each node is to find the “best” feature (\(x_i\)) to partition the remaining data into one of two regions (\(R_1\) and \(R_2\)) such that the overall error between the actual response (\(y_i\)) and the predicted constant (\(c_i\)) is minimized. For regression problems, the objective function to minimize is the total SSE as defined in the following equation:

\[\begin{equation} SSE = \sum_{i \in R_1}\left(y_i - c_1\right)^2 + \sum_{i \in R_2}\left(y_i - c_2\right)^2 \end{equation}\]

For classification problems, the partitioning is usually made to maximize the reduction in cross-entropy or the Gini index (see model evaluation).2

In both regression and classification trees, the objective of partitioning is to minimize dissimilarity in the terminal nodes. However, we suggest Therneau, Atkinson, and others (1997) for a more thorough discussion regarding binary recursive partitioning.

Having found the best feature/split combination, the data are partitioned into two regions and the splitting process is repeated on each of the two regions (hence the name binary recursive partitioning). This process is continued until a suitable stopping criterion is reached (e.g., a maximum depth is reached or the tree becomes “too complex”).

It’s important to note that a single feature can be used multiple times in a tree. For example, say we have data generated from a simple \(\sin\) function with Gaussian noise: \(Y_i \stackrel{iid}{\sim} N\left(\sin\left(X_i\right), \sigma^2\right)\), for \(i = 1, 2, \dots, 500\). A regression tree built with a single root node (often referred to as a decision stump) leads to a split occurring at \(x = 3.1\).

Figure: Decision tree illustrating the single split on feature x (left). The resulting decision boundary illustrates the predicted value when x < 3.1 (0.64), and when x > 3.1 (-0.67) (right).Figure: Decision tree illustrating the single split on feature x (left). The resulting decision boundary illustrates the predicted value when x < 3.1 (0.64), and when x > 3.1 (-0.67) (right).

Figure: Decision tree illustrating the single split on feature x (left). The resulting decision boundary illustrates the predicted value when x < 3.1 (0.64), and when x > 3.1 (-0.67) (right).

If we build a deeper tree, we’ll continue to split on the same feature (\(x\)) as illustrated below. This is because \(x\) is the only feature available to split on so it will continue finding the optimal splits along this feature’s values until a pre-determined stopping criteria is reached.

Figure: Decision tree illustrating with depth = 3, resulting in 7 decision splits along values of feature x and 8 prediction regions (left). The resulting decision boundary (right).Figure: Decision tree illustrating with depth = 3, resulting in 7 decision splits along values of feature x and 8 prediction regions (left). The resulting decision boundary (right).

Figure: Decision tree illustrating with depth = 3, resulting in 7 decision splits along values of feature x and 8 prediction regions (left). The resulting decision boundary (right).

However, even when many features are available, a single feature may still dominate if it continues to provide the best split after each successive partition. For example, a decision tree applied to the iris data set (R. A. Fisher 1936) where the species of the flower (setosa, versicolor, and virginica) is predicted based on two features (sepal width and sepal length) results in an optimal decision tree with two splits on each feature. Also, note how the decision boundary in a classification problem results in rectangular regions enclosing the observations. The predicted value is the response class with the greatest proportion within the enclosed region.

Figure: Decision tree for the iris classification problem (left). The decision boundary results in rectangular regions that enclose the observations.  The class with the highest proportion in each region is the predicted value (right).Figure: Decision tree for the iris classification problem (left). The decision boundary results in rectangular regions that enclose the observations.  The class with the highest proportion in each region is the predicted value (right).

Figure: Decision tree for the iris classification problem (left). The decision boundary results in rectangular regions that enclose the observations. The class with the highest proportion in each region is the predicted value (right).

How deep?

This leads to an important question: how deep (i.e., complex) should we make the tree? If we grow an overly complex tree as in the below figure, we tend to overfit to our training data resulting in poor generalization performance.

Figure: Overfit decision tree with 56 splits.Figure: Overfit decision tree with 56 splits.

Figure: Overfit decision tree with 56 splits.

Consequently, there is a balance to be achieved in the depth and complexity of the tree to optimize predictive performance on future unseen data. To find this balance, we have two primary approaches: (1) early stopping and (2) pruning.

Early stopping

Early stopping explicitly restricts the growth of the tree. There are several ways we can restrict tree growth but two of the most common approaches are to restrict the tree depth to a certain level or to restrict the minimum number of observations allowed in any terminal node. When limiting tree depth we stop splitting after a certain depth (e.g., only grow a tree that has a depth of 5 levels). The shallower the tree the less variance we have in our predictions; however, at some point we can start to inject too much bias as shallow trees (e.g., stumps) are not able to capture interactions and complex patterns in our data.

When restricting minimum terminal node size (e.g., leaf nodes must contain at least 10 observations for predictions) we are deciding to not split intermediate nodes which contain too few data points. At the far end of the spectrum, a terminal node’s size of one allows for a single observation to be captured in the leaf node and used as a prediction (in this case, we’re interpolating the training data). This results in high variance and poor generalizability. On the other hand, large values restrict further splits therefore reducing variance.

These two approaches can be implemented independently of one another; however, they do have interaction effects as illustrated below.

Figure: Illustration of how early stopping affects the decision boundary of a regression decision tree. The columns illustrate how tree depth impacts the decision boundary and the rows illustrate how the minimum number of observations in the terminal node influences the decision boundary.

Figure: Illustration of how early stopping affects the decision boundary of a regression decision tree. The columns illustrate how tree depth impacts the decision boundary and the rows illustrate how the minimum number of observations in the terminal node influences the decision boundary.

Pruning

An alternative to explicitly specifying the depth of a decision tree is to grow a very large, complex tree and then prune it back to find an optimal subtree. We find the optimal subtree by using a cost complexity parameter (\(\alpha\)) that penalizes our objective function for the number of terminal nodes of the tree (\(T\)) as in the following equation.

\[\begin{equation} \texttt{minimize} \left\{ SSE + \alpha \vert T \vert \right\} \end{equation}\]

For a given value of \(\alpha\) we find the smallest pruned tree that has the lowest penalized error. You may recognize the close association to the lasso penalty discussed in the regularized regression module. As with the regularization methods, smaller penalties tend to produce more complex models, which result in larger trees. Whereas larger penalties result in much smaller trees. Consequently, as a tree grows larger, the reduction in the SSE must be greater than the cost complexity penalty. Typically, we evaluate multiple models across a spectrum of \(\alpha\) and use CV to identify the optimal value and, therefore, the optimal subtree that generalizes best to unseen data.

Figure: To prune a tree, we grow an overly complex tree (left) and then use a cost complexity parameter to identify the optimal subtree (right).

Figure: To prune a tree, we grow an overly complex tree (left) and then use a cost complexity parameter to identify the optimal subtree (right).

Basic implementation

Simple model

To illustrate some of the concepts we’ve mentioned we’ll start by implementing models using just the Gr_Liv_Area and Year_Built features in our Ames housing data.

In Python we use the DecisionTreeRegressor and DecisionTreeClassifier classes from the sklearn.tree module. In this example we’ll fit a shallow tree that is only 3 levels deep.

# Step 1: create model object
dt_mod = DecisionTreeRegressor(max_depth=3)

# Step 2: fit/train model
dt_fit = dt_mod.fit(X_train[["Gr_Liv_Area", "Year_Built"]], y_train)

After 3 levels of decision splits, our model has 8 prediction leaf nodes.

dt_fit.get_n_leaves()
## 8

We can use the plot_tree() function to visualize our tree. This is only useful if we have a relatively small tree to visualize; however, most trees we will build will be far too large to attempt to visualize. In this case, we see that the root node (first node) splits our data based on X[1] (Year_Built). For those observations where the home is built after 1985 we follow the right half of the decision tree and for those where the home is built in or prior to 1985 we follow the left half of the decision tree.

plot_tree(dt_fit)
## [Text(271.25, 336.875, 'X[1] <= 1985.5\nmse = 6539446743.778\nsamples = 2051\nvalue = 182208.494'), Text(135.625, 240.625, 'X[0] <= 1800.5\nmse = 2248041088.563\nsamples = 1217\nvalue = 142158.823'), Text(67.8125, 144.375, 'X[0] <= 1225.0\nmse = 1189087362.552\nsamples = 1033\nvalue = 131862.157'), Text(33.90625, 48.125, 'mse = 794890572.988\nsamples = 570\nvalue = 118160.8'), Text(101.71875, 48.125, 'mse = 1158751047.885\nsamples = 463\nvalue = 148729.918'), Text(203.4375, 144.375, 'X[0] <= 2663.5\nmse = 4256298331.879\nsamples = 184\nvalue = 199965.652'), Text(169.53125, 48.125, 'mse = 3121739613.457\nsamples = 169\nvalue = 190768.521'), Text(237.34375, 48.125, 'mse = 5348662488.889\nsamples = 15\nvalue = 303586.667'), Text(406.875, 240.625, 'X[0] <= 1963.0\nmse = 7045589290.731\nsamples = 834\nvalue = 240650.279'), Text(339.0625, 144.375, 'X[0] <= 1517.0\nmse = 2972224827.522\nsamples = 640\nvalue = 213591.936'), Text(305.15625, 48.125, 'mse = 1303712716.309\nsamples = 313\nvalue = 186093.431'), Text(372.96875, 48.125, 'mse = 3152703070.673\nsamples = 327\nvalue = 239913.135'), Text(474.6875, 144.375, 'X[0] <= 2445.5\nmse = 10099963814.591\nsamples = 194\nvalue = 329914.918'), Text(440.78125, 48.125, 'mse = 5520113737.718\nsamples = 119\nvalue = 299000.748'), Text(508.59375, 48.125, 'mse = 13444354503.467\nsamples = 75\nvalue = 378965.4')]
plt.show()

However, to understand how our model is performing we want to perform cross validation. In this example we will not limit the depth of our tree. We see that this single decision tree is not performing spectacularly with the RMSE across our 5 folds equalling about $50K.

# create DT model object
dt_mod = DecisionTreeRegressor()

# define loss function
loss = 'neg_root_mean_squared_error'

# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)

# fit model with 5-fold CV
results = cross_val_score(dt_mod, X_train[["Gr_Liv_Area", "Year_Built"]], 
                          y_train, cv=kfold, scoring=loss)


np.round(np.abs(results))
## array([49067., 53235., 51495., 50527., 53605.])

In R we use the decision_tree() model and we’ll use the rpart package as our model engine. In this example we will not set a specific depth of our tree; rather, rpart automatically builds a fully deep tree and then prunes it to attempt to find an optimal tree depth.

# Step 1: create ridge model object
dt_mod <- decision_tree(mode = "regression") %>% set_engine("rpart")

# Step 2: create model recipe
model_recipe <- recipe(
    Sale_Price ~ Gr_Liv_Area + Year_Built, 
    data = ames_train
  )
  
# Step 3: fit model workflow
dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = ames_train)

# Step 4: results
dt_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 2049 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 2049 1.321981e+13 180922.6  
##    2) Year_Built< 1985.5 1228 2.698241e+12 141467.4  
##      4) Gr_Liv_Area< 1486 840 7.763942e+11 125692.6  
##        8) Year_Built< 1952.5 317 2.687202e+11 107761.5 *
##        9) Year_Built>=1952.5 523 3.439725e+11 136561.0 *
##      5) Gr_Liv_Area>=1486 388 1.260276e+12 175619.2  
##       10) Gr_Liv_Area< 2663.5 372 8.913502e+11 170223.5 *
##       11) Gr_Liv_Area>=2663.5 16 1.062939e+11 301068.8 *
##    3) Year_Built>=1985.5 821 5.750622e+12 239937.1  
##      6) Gr_Liv_Area< 1963 622 1.813069e+12 211699.5  
##       12) Gr_Liv_Area< 1501.5 285 3.483774e+11 182098.8 *
##       13) Gr_Liv_Area>=1501.5 337 1.003788e+12 236732.8  
##         26) Year_Built< 2004.5 198 3.906393e+11 217241.1 *
##         27) Year_Built>=2004.5 139 4.307663e+11 264498.0 *
##      7) Gr_Liv_Area>=1963 199 1.891416e+12 328197.2  
##       14) Gr_Liv_Area< 2390.5 107 5.903157e+11 290924.0  
##         28) Year_Built< 2004.5 69 1.168804e+11 253975.7 *
##         29) Year_Built>=2004.5 38 2.081946e+11 358014.5 *
##       15) Gr_Liv_Area>=2390.5 92 9.795556e+11 371547.5 *

If we plot our tree we see that the root node is the same as the Python example. For those observations where the home is built after 1985 we follow the right half of the decision tree and for those where the home is built in or prior to 1985 we follow the left half of the decision tree.

rpart.plot(dt_fit$fit$fit$fit)

However, to understand how our model is performing we want to perform cross validation. We see that this single decision tree is not performing spectacularly well with the average RMSE across our 5 folds equalling just under $50K.

# create resampling procedure
set.seed(13)
kfold <- vfold_cv(ames_train, v = 5)

# train model
results <- fit_resamples(dt_mod, model_recipe, kfold)

# model results
collect_metrics(results)
## # A tibble: 2 x 6
##   .metric .estimator      mean     n   std_err .config             
##   <chr>   <chr>          <dbl> <int>     <dbl> <chr>               
## 1 rmse    standard   47178.        5 1193.     Preprocessor1_Model1
## 2 rsq     standard       0.655     5    0.0129 Preprocessor1_Model1

Full model

Next, lets go ahead and fit a full model to include all Ames housing features.

In Python we must convert our categorical features to numeric values. For this example we’ll simply dummy encode them using Pandas get_dummies(). We see some improvement in our model performance as our average cross validated RMSE is now in the low $40K range.

# create new feature set with encoded features
X_train_encoded = pd.get_dummies(X_train)

# fit model with 5-fold CV
results = cross_val_score(dt_mod, X_train_encoded, y_train, cv=kfold, scoring=loss)

np.abs(np.mean(results))
## 42758.270879934236

In R, we do not need to one-hot encode our features as rpart will naturally handle categorical features. By including all features we see some improvement in our model performance as our average cross validated RMSE is now in the low $40K range.

# create model recipe with all features
full_model_recipe <- recipe(
    Sale_Price ~ ., 
    data = ames_train
  )

# train model
results <- fit_resamples(dt_mod, full_model_recipe, kfold)

# model results
collect_metrics(results)
## # A tibble: 2 x 6
##   .metric .estimator      mean     n   std_err .config             
##   <chr>   <chr>          <dbl> <int>     <dbl> <chr>               
## 1 rmse    standard   40510.        5 1694.     Preprocessor1_Model1
## 2 rsq     standard       0.745     5    0.0148 Preprocessor1_Model1

Tuning

As previously mentioned, the tree depth is the primary factor that impacts performance. We can control tree depth via a few different parameters:

  1. Max depth: we can explicitly state the maximum depth a tree can be grown.
  2. Minimium observations for a split: The minimum number of samples required to split an internal node. This limits a tree from continuing to grow as the number of observations in a give node becomes smaller.
  3. Cost complexity parameter: acts as a regularization mechanism by penalizing the objective function.

There is not one best approach to use and often different combinations of these parameter settings improves model performance. The following will demonstrate a small grid search across 3 different values for each of these parameters (\(3^3 = 27\) total setting combinations).

Note that each language has additional default parameters that often differ so we can find a combination of values that optimizes the Python implementation, which differs from the optimal R implementation. In both examples we see the optimal model decreases our average CV RMSE into the mid-to-upper $30K range.

It is common to run additional grid searches after the first grid search. These additional grid searches uses the first grid search to find parameter values that perform well and then continue to analyze additional ranges around these values.

# create model object
dt_mod = DecisionTreeRegressor()

# define loss function
loss = 'neg_root_mean_squared_error'

# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)

# Create grid of hyperparameter values
hyper_grid = {
  'ccp_alpha': [1e-1, 1e-5, 1e-10],
  'max_depth': [1, 8, 15],
  'min_samples_split': [2, 21, 40]
  }
grid_search = GridSearchCV(dt_mod, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train_encoded, y_train)

# Optimal penalty parameter in grid search
results.best_estimator_
## DecisionTreeRegressor(ccp_alpha=0.1, max_depth=15, min_samples_split=40)
# Best model's cross validated RMSE
round(abs(results.best_score_), 2)
## 37854.51

# create model object with tuning options
dt_mod <- decision_tree(
  mode = "regression",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
  ) %>% 
  set_engine("rpart")

# create the hyperparameter grid
hyper_grid <- grid_regular(
 cost_complexity(), 
 tree_depth(),
 min_n()
 )

# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(dt_mod, full_model_recipe, resamples = kfold, grid = hyper_grid)

# get best results
show_best(results, metric = "rmse", n = 10)
## # A tibble: 10 x 9
##    cost_complexity tree_depth min_n .metric .estimator   mean     n std_err
##              <dbl>      <int> <int> <chr>   <chr>       <dbl> <int>   <dbl>
##  1    0.0000000001          8    21 rmse    standard   34883.     5    971.
##  2    0.00000316            8    21 rmse    standard   34883.     5    971.
##  3    0.0000000001         15    21 rmse    standard   34986.     5    962.
##  4    0.00000316           15    21 rmse    standard   34986.     5    962.
##  5    0.0000000001         15    40 rmse    standard   36018.     5   1028.
##  6    0.00000316           15    40 rmse    standard   36018.     5   1028.
##  7    0.0000000001          8    40 rmse    standard   36150.     5   1011.
##  8    0.00000316            8    40 rmse    standard   36150.     5   1011.
##  9    0.00000316            8     2 rmse    standard   37161.     5   1271.
## 10    0.0000000001          8     2 rmse    standard   37173.     5   1287.
## # … with 1 more variable: .config <chr>

Feature interpretation

We can use a similar approach as we have in the past to find the most influential features in our decision tree models. For both R and Python we see similar features that appear to be leading influencers in our models. And the partial dependence plots show our relationships are no longer assuming any type of linear relationship.

# create final model object
best_mod = results.best_estimator_
best_mod_fit = best_mod.fit(X_train_encoded, y_train)

# extract feature importances
vi = pd.DataFrame({'feature': X_train_encoded.columns,
                   'importance': best_mod_fit.feature_importances_})

# get top 20 influential features
top_20_features = vi.nlargest(20, 'importance')

# plot feature importance
(ggplot(top_20_features, aes(x='importance', y='reorder(feature, importance)'))
 + geom_point()
 + labs(y=None))

pd_results = partial_dependence(
  best_mod_fit, X_train_encoded, "Garage_Cars", kind='average',
  percentiles=(0, 1)) 
  
pd_output = pd.DataFrame({'Garage_Cars': pd_results['values'][0],
                          'yhat': pd_results['average'][0]})
                          
(ggplot(pd_output, aes('Garage_Cars', 'yhat'))
  + geom_line())

best_model <- select_best(results, 'rmse')

final_model <- decision_tree(
  mode = "regression",
  cost_complexity = best_model$cost_complexity,
  tree_depth = best_model$tree_depth,
  min_n = best_model$min_n
  ) %>% 
  set_engine("rpart")

final_fit <- workflow() %>%
  add_recipe(full_model_recipe) %>%
  add_model(final_model) %>%
  fit(data = ames_train)

final_fit %>%
  pull_workflow_fit() %>%
  vip(20)

# prediction function
pdp_pred_fun <- function(object, newdata) {
  mean(predict(object, newdata, type = "numeric")$.pred)
}

# use the pdp package to extract partial dependence predictions
# and then plot
final_fit %>%
  pdp::partial(
   pred.var = "Overall_Qual", 
   pred.fun = pdp_pred_fun,
   grid.resolution = 10, 
   train = ames_train
  ) %>%
  ggplot(aes(yhat, Overall_Qual)) +
  geom_col() +
  scale_x_continuous(labels = scales::dollar)

Final thoughts

Decision trees have a number of advantages. Trees require very little pre-processing. This is not to say feature engineering may not improve upon a decision tree, but rather, that there are no pre-processing requirements. Monotonic transformations (e.g., \(\log\), \(\exp\), and \(\sqrt{}\)) are not required to meet algorithm assumptions as in many parametric models; instead, they only shift the location of the optimal split points. Outliers typically do not bias the results as much since the binary partitioning simply looks for a single location to make a split within the distribution of each feature.

Decision trees can easily handle categorical features without preprocessing. For unordered categorical features with more than two levels, the classes are ordered based on the outcome (for regression problems, the mean of the response is used and for classification problems, the proportion of the positive outcome class is used). For more details see Friedman, Hastie, and Tibshirani (2001), Breiman and Ihaka (1984), Ripley (2007), W. D. Fisher (1958), and Loh and Vanichsetakul (1988).

Missing values often cause problems with statistical models and analyses. Most procedures deal with them by refusing to deal with them—incomplete observations are tossed out. However, most decision tree implementations can easily handle missing values in the features and do not require imputation. This is handled in various ways but most commonly by creating a new “missing” class for categorical variables or using surrogate splits (see Therneau, Atkinson, and others (1997) for details).

However, individual decision trees generally do not often achieve state-of-the-art predictive accuracy. In this module, we saw that the best pruned decision tree, although it performed better than linear regression, had a very poor RMSE (~$41,000) compared to some of the other models we’ve built. This is driven by the fact that decision trees are composed of simple yes-or-no rules that create rigid non-smooth decision boundaries. Furthermore, we saw that deep trees tend to have high variance (and low bias) and shallow trees tend to be overly bias (but low variance). In the modules that follow, we’ll see how we can combine multiple trees together into very powerful prediction models called ensembles.

Exercises

Using the Boston housing data set, where the response feature is the median value of homes within a census tract (cmedv):

  1. Apply a decision tree model with all features.
  2. How many internal splitting nodes optimize model performance?
  3. Can you identify the predicted values and SEE in the terminal nodes?
  4. Identify the first feature split node and explain how it is splitting this feature.
  5. Which 10 features are considered most influential? Are these the same features that have been influential in previous models?
  6. Now perform 1-5 to the Attrition dataset, which is classification model rather than a regression model.

🏠

References

Breiman, Leo. 1984. Classification and Regression Trees. Routledge.
Breiman, Leo, and Ross Ihaka. 1984. Nonlinear Discriminant Analysis via Scaling and ACE. Department of Statistics, University of California.
Fisher, Ronald A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88.
Fisher, Walter D. 1958. “On Grouping for Maximum Homogeneity.” Journal of the American Statistical Association 53 (284): 789–98.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer Series in Statistics New York, NY, USA:
Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3): 651–74.
Kass, Gordon V. 1980. “An Exploratory Technique for Investigating Large Quantities of Categorical Data.” Applied Statistics, 119–27.
Loh, Wei-Yin, and Nunta Vanichsetakul. 1988. “Tree-Structured Classification via Generalized Discriminant Analysis.” Journal of the American Statistical Association 83 (403): 715–25.
Quinlan, J. Ross. 1986. “Induction of Decision Trees.” Machine Learning 1 (1): 81–106.
Quinlan, J Ross, and others. 1996. “Bagging, Boosting, and C4. 5.” In AAAI/IAAI, Vol. 1, 725–30.
Ripley, Brian D. 2007. Pattern Recognition and Neural Networks. Cambridge University Press.
Therneau, Terry M, Elizabeth J Atkinson, and others. 1997. “An Introduction to Recursive Partitioning Using the RPART Routines.” Mayo Foundation.

  1. Other decision tree algorithms include the Iterative Dichotomiser 3 (J. Ross Quinlan 1986), C4.5 (J. Ross Quinlan and others 1996), Chi-square automatic interaction detection (Kass 1980), Conditional inference trees (Hothorn, Hornik, and Zeileis 2006), and more.↩︎

  2. Gini index and cross-entropy are the two most commonly applied loss functions used for decision trees. Classification error is rarely used to determine partitions as they are less sensitive to poor performing splits (Friedman, Hastie, and Tibshirani 2001).↩︎