Much like EDA, the ML process is very iterative and heuristic-based. With minimal knowledge of the problem or data at hand, it is difficult to know which ML method will perform best. This is known as the no free lunch theorem for ML (Wolpert 1996). Consequently, it is common for many ML approaches to be applied, evaluated, and modified before a final, optimal model can be determined. Performing this process correctly provides great confidence in our outcomes. If not, the results will be useless and, potentially, damaging.1

Approaching ML modeling correctly means approaching it strategically by spending our data wisely on learning and validation procedures, properly pre-processing the feature and target variables, minimizing data leakage, tuning hyperparameters, and assessing model performance. Many books and courses portray the modeling process as a short sprint. A better analogy would be a marathon where many iterations of these steps are repeated before eventually finding the final optimal model. This process is illustrated in the figure below.

General predictive machine learning process.

General predictive machine learning process.

Learning objectives

Before introducing specific algorithms, this module, and the next, introduce concepts that are fundamental to the ML modeling process and that you will see briskly covered in future modules. By the end of this model you will be able to:

  1. Split your data into training and test sets.
  2. Instantiate, train, fit, and evaluate a basic model in both R and Python.
  3. Apply k-fold resampling and hyperparameter tuning procedures to improve the robustness and performance of a model.
  4. Put these steps together for an end-to-end ML process.

Prerequisites

This section leverages the following packages. We will demonstrate concepts on the Ames housing and employee attrition data sets.

# Helper packages
import numpy as np
import pandas as pd
import math
from plotnine import ggplot, aes, geom_density, geom_line, geom_point, ggtitle

# Modeling process
from sklearn.model_selection import train_test_split, KFold, RepeatedKFold, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_squared_error, roc_auc_score
# Ames housing data
ames = pd.read_csv("data/ames.csv")

# Job attrition data
attrition = pd.read_csv("data/attrition.csv")

# Helper packages
library(tidyverse)

# Modeling process
library(tidymodels)
# Ames housing data
ames = read_csv("data/ames.csv")

# Job attrition data
attrition = read_csv("data/attrition.csv")

Data splitting

A major goal of the machine learning process is to find an algorithm \(f\left(X\right)\) that most accurately predicts future values (\(\hat{Y}\)) based on a set of features (\(X\)). In other words, we want an algorithm that not only fits well to our past data, but more importantly, one that predicts a future outcome accurately. This is called the generalizability of our algorithm. How we “spend” our data will help us understand how well our algorithm generalizes to unseen data.

To provide an accurate understanding of the generalizability of our final optimal model, we can split our data into training and test data sets:

  • Training set: these data are used to develop feature sets, train our algorithms, tune hyperparameters, compare models, and all of the other activities required to choose a final model (e.g., the model we want to put into production).
  • Test set: having chosen a final model, these data are used to estimate an unbiased assessment of the model’s performance, which we refer to as the generalization error.
Splitting data into training and test sets.

Splitting data into training and test sets.

It is critical that the test set not be used prior to selecting your final model. Assessing results on the test set prior to final model selection biases the model selection process since the testing data will have become part of the model development process.

Given a fixed amount of data, typical recommendations for splitting your data into training-test splits include 60% (training)–40% (testing), 70%–30%, or 80%–20%. Generally speaking, these are appropriate guidelines to follow; however, it is good to keep the following points in mind:

  • Spending too much in training (e.g., \(>80\%\)) won’t allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting).
  • Sometimes too much spent in testing (\(>40\%\)) won’t allow us to get a good assessment of model parameters.

Other factors should also influence the allocation proportions. For example, very large training sets (e.g., \(n > 100\texttt{K}\)) often result in only marginal gains compared to smaller sample sizes. Consequently, you may use a smaller training sample to increase computation speed (e.g., models built on larger training sets often take longer to score new data sets in production). In contrast, as \(p \geq n\) (where \(p\) represents the number of features), larger samples sizes are often required to identify consistent signals in the features.

The two most common ways of splitting data include simple random sampling and stratified sampling.

Simple random sampling

The simplest way to split the data into training and test sets is to take a simple random sample. This does not control for any data attributes, such as the distribution of your response variable (\(Y\)).

Sampling is a random process so setting the random number generator with a common seed allows for reproducible results. Throughout this course we’ll often use the seed 123 for reproducibility but the number itself has no special meaning.

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# dimensions of training data
train.shape
## (2051, 81)

With sufficient sample size, this sampling approach will typically result in a similar distribution of \(Y\) (e.g., Sale_Price in the ames data) between your training and test sets, as illustrated below.

train['id'] = 'train'
test['id'] = 'test'

(ggplot(pd.concat([train, test]), aes('Sale_Price', color='id'))
 + geom_density()
 + ggtitle("Random sampling with Python"))

# create train/test split
set.seed(123)  # for reproducibility
split  <- initial_split(ames, prop = 0.7)
train  <- training(split)
test   <- testing(split)

# dimensions of training data
dim(train)
## [1] 2051   81

With sufficient sample size, this sampling approach will typically result in a similar distribution of \(Y\) (e.g., Sale_Price in the ames data) between your training and test sets, as illustrated below.

train %>% 
  mutate(id = 'train') %>% 
  bind_rows(test %>% mutate(id = 'test')) %>%
  ggplot(aes(Sale_Price, color = id)) +
  geom_density() +
  ggtitle("Random sampling with R")

Stratified sampling

If we want to explicitly control the sampling so that our training and test sets have similar \(Y\) distributions, we can use stratified sampling. This is more common with classification problems where the response variable may be severely imbalanced (e.g., 90% of observations with response “Yes” and 10% with response “No”). However, we can also apply stratified sampling to regression problems for data sets that have a small sample size and where the response variable deviates strongly from normality. With a continuous response variable, stratified sampling will segment \(Y\) into quantiles and randomly sample from each.

To perform stratified sampling in Python we simply apply the stratify argument in train_test_split().

y = attrition["Attrition"]
train_strat, test_strat = train_test_split(attrition, train_size=0.3, random_state=123, 
                                           stratify=y)

The following illustrates that in our original employee attrition data we have an imbalanced response (No: 84%, Yes: 16%). By enforcing stratified sampling, both our training and testing sets have approximately equal response distributions.

# response distribution for raw data
attrition["Attrition"].value_counts(normalize=True)

# response distribution for training data
## No     0.838776
## Yes    0.161224
## Name: Attrition, dtype: float64
train_strat["Attrition"].value_counts(normalize=True)

# response distribution for test data
## No     0.839002
## Yes    0.160998
## Name: Attrition, dtype: float64
test_strat["Attrition"].value_counts(normalize=True)
## No     0.838678
## Yes    0.161322
## Name: Attrition, dtype: float64

To perform stratified sampling in R we simply apply the strata argument in initial_split.

set.seed(123)
split_strat <- initial_split(attrition, prop = 0.7, strata = "Attrition")
train_strat <- training(split_strat)
test_strat  <- testing(split_strat)

The following illustrates that in our original employee attrition data we have an imbalanced response (No: 84%, Yes: 16%). By enforcing stratified sampling, both our training and testing sets have approximately equal response distributions.

# orginal response distribution
table(attrition$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.8387755 0.1612245
# response distribution for training data
table(train_strat$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.8394942 0.1605058
# response distribution for test data
table(test_strat$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.8371041 0.1628959

Class imbalances

Imbalanced data can have a significant impact on model predictions and performance (Kuhn and Johnson 2013). Most often this involves classification problems where one class has a very small proportion of observations (e.g., defaults - 5% versus nondefaults - 95%). Several sampling methods have been developed to help remedy class imbalance and most of them can be categorized as either up-sampling or down-sampling.

Down-sampling balances the dataset by reducing the size of the abundant class(es) to match the frequencies in the least prevalent class. This method is used when the quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modeling. Furthermore, the reduced sample size reduces the computation burden imposed by further steps in the ML process.

On the contrary, up-sampling is used when the quantity of data is insufficient. It tries to balance the dataset by increasing the size of rarer samples. Rather than getting rid of abundant samples, new rare samples are generated by using repetition or bootstrapping.

There is no absolute advantage of one sampling method over another. Application of these two methods depends on the use case it applies to and the data set itself. A combination of over- and under-sampling is often successful and a common approach is known as Synthetic Minority Over-Sampling Technique, or SMOTE (Chawla et al. 2002).

Both R and Python provide many options for dealing with heavily imbalanced classes.

Creating models

Throughout this course we will apply many different models so you should become quite comfortable with the process. The process of fitting a model is relatively simple and has a similar feel between the two languages.

In Python, we are required to separate our features from our label into discrete data sets. For our first model we will simply use two features from our training data - total square feet of the home (Gr_Liv_Area) and year built (Year_Built) to predict the sale price.

Scikit-learn has many modules for different model types. One module is the sklearn.neighbors which contains various methods for unsupervised and supervised neighbors-based learning models. In our example, we are going to apply a K-Nearest neighbor regression model since Sale_Price is a continuous response. We’ll use KNeighborsRegressor to do so and in this example we’ll simply fit our model to 10 neighbors.

We will discuss K-Nearest neighbor (KNN) models in detail in a later module but for now just consider we are trying to predict the price of a home based on the average price of 10 other homes that seem to be most similar it.

First we create the model object (knn) and then fit the model to our training data.

# separate features from labels
X_train = train[["Gr_Liv_Area", "Year_Built"]]
y_train = train["Sale_Price"]

# fit a KNN regression model with 10 neighbors
knn = KNeighborsRegressor(n_neighbors=10)
m1 = knn.fit(X_train, y_train)

m1
## KNeighborsRegressor(n_neighbors=10)

We have fit our model, if we want to see our predictions we can simply apply predict() and feed it the data set we want to make predictions on:

m1.predict(X_train)
## array([203900. , 129440. , 147725. , ..., 342814.3, 183370. , 142476.9])

In R, we can either separate our features from our label into discrete data sets or we can use what is called a “functional form” (y ~ x) to specify our model. For our first model we will simply use two features from our training data - total square feet of the home (Gr_Liv_Area) and year built (Year_Built) to predict the sale price.

The Parsnip package provides one common interface to train many different models supplied by other packages. To create and fit a model with parsnip we follow 4 steps:

  1. Create a model type. There are man different types of models as we will see throughout this course but in this example we want to create a KNN model object with nearest_neighbor fit to 10 neighbors.
  2. In R, there are many packages that provide different algorithms so we want to tell parsnip which package to use as the “engine” to apply KNN. In this example we will use the kknn package.
  3. Next we set the mode (i.e. regression or classification).
  4. And finally fit our model object to our training data.

We will discuss K-Nearest neighbor (KNN) models in detail in a later module but for now just consider we are trying to predict the price of a home based on the average price of 10 other homes that seem to be most similar it.

First we create the model object (knn) and then fit the model to our training data.

knn <- nearest_neighbor(neighbors = 10) %>%
  set_engine("kknn") %>%
  set_mode("regression")

m1 <- fit(knn, formula = Sale_Price ~ Gr_Liv_Area + Year_Built, data = train)

m1
## parsnip model object
## 
## Fit time:  49ms 
## 
## Call:
## kknn::train.kknn(formula = Sale_Price ~ Gr_Liv_Area + Year_Built,     data = data, ks = min_rows(10, data, 5))
## 
## Type of response variable: continuous
## minimal mean absolute error: 26868.77
## Minimal mean squared error: 1620084976
## Best kernel: optimal
## Best k: 10

We have fit our model, if we want to see our predictions we can simply apply predict() and feed it the data set we want to make predictions on:

m1 %>% predict(new_data = train)
## # A tibble: 2,051 x 1
##      .pred
##      <dbl>
##  1 199844.
##  2 191032 
##  3 181602 
##  4 279967.
##  5 108980.
##  6 178500.
##  7 176050 
##  8 119794 
##  9 115414 
## 10 123374.
## # … with 2,041 more rows

Evaluating models

It is important to understand how our model is performing. With ML models, measuring performance means understanding the predictive accuracy – the difference between a predicted value and the actual value. We measure predictive accuracy with loss functions.

There are many loss functions to choose from when assessing the performance of a predictive model, each providing a unique understanding of the predictive accuracy and differing between regression and classification models. Furthermore, the way a loss function is computed will tend to emphasize certain types of errors over others and can lead to drastic differences in how we interpret the “optimal model.” Its important to consider the problem context when identifying the preferred performance metric to use. And when comparing multiple models, we need to compare them across the same metric.

Regression models

  • MSE: Mean squared error is the average of the squared error (\(MSE = \frac{1}{n} \sum^n_{i=1}(y_i - \hat y_i)^2\))2. The squared component results in larger errors having larger penalties. This (along with RMSE) is the most common error metric to use. Objective: minimize

  • RMSE: Root mean squared error. This simply takes the square root of the MSE metric (\(RMSE = \sqrt{\frac{1}{n} \sum^n_{i=1}(y_i - \hat y_i)^2}\)) so that your error is in the same units as your response variable. If your response variable units are dollars, the units of MSE are dollars-squared, but the RMSE will be in dollars. Objective: minimize

  • Deviance: Short for mean residual deviance. In essence, it provides a degree to which a model explains the variation in a set of data when using maximum likelihood estimation. Essentially this compares a saturated model (i.e. fully featured model) to an unsaturated model (i.e. intercept only or average). If the response variable distribution is Gaussian, then it will be approximately equal to MSE. When not, it usually gives a more useful estimate of error. Deviance is often used with classification models.3 Objective: minimize

  • MAE: Mean absolute error. Similar to MSE but rather than squaring, it just takes the mean absolute difference between the actual and predicted values (\(MAE = \frac{1}{n} \sum^n_{i=1}(\vert y_i - \hat y_i \vert)\)). This results in less emphasis on larger errors than MSE. Objective: minimize

  • RMSLE: Root mean squared logarithmic error. Similar to RMSE but it performs a log() on the actual and predicted values prior to computing the difference (\(RMSLE = \sqrt{\frac{1}{n} \sum^n_{i=1}(log(y_i + 1) - log(\hat y_i + 1))^2}\)). When your response variable has a wide range of values, large response values with large errors can dominate the MSE/RMSE metric. RMSLE minimizes this impact so that small response values with large errors can have just as meaningful of an impact as large response values with large errors. Objective: minimize

  • \(R^2\): This is a popular metric that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). Unfortunately, it has several limitations. For example, two models built from two different data sets could have the exact same RMSE but if one has less variability in the response variable then it would have a lower \(R^2\) than the other. You should not place too much emphasis on this metric. Objective: maximize

Most models we assess in this course will report most, if not all, of these metrics. We will emphasize MSE and RMSE but it’s important to realize that certain situations warrant emphasis on some metrics more than others.

You will see differences between the R and Python loss scores in our code snippets. This is because there are some differences in the default settings between the languages’ functions.

The following illustrates how to compute the MSE and RMSE for our training set.

pred = m1.predict(X_train)

# compute MSE
mse = mean_squared_error(y_train, pred)
mse
## 1848235388.3486347
# compute RMSE
math.sqrt(mse)
## 42991.10824750433

In R, the simple summary of our model object will provide the MSE.

m1
## parsnip model object
## 
## Fit time:  49ms 
## 
## Call:
## kknn::train.kknn(formula = Sale_Price ~ Gr_Liv_Area + Year_Built,     data = data, ks = min_rows(10, data, 5))
## 
## Type of response variable: continuous
## minimal mean absolute error: 26868.77
## Minimal mean squared error: 1620084976
## Best kernel: optimal
## Best k: 10

And we can compute the RMSE with the following:

pred <- m1 %>%  predict(new_data = train)

rmse_vec(truth = train$Sale_Price, estimate = pred$.pred)
## [1] 32913.04

Classification models

  • Misclassification: This is the overall error. For example, say you are predicting 3 classes ( high, medium, low ) and each class has 25, 30, 35 observations respectively (90 observations total). If you misclassify 3 observations of class high, 6 of class medium, and 4 of class low, then you misclassified 13 out of 90 observations resulting in a 14% misclassification rate. Objective: minimize

  • Mean per class error: This is the average error rate for each class. For the above example, this would be the mean of \(\frac{3}{25}, \frac{6}{30}, \frac{4}{35}\), which is 14.5%. If your classes are balanced this will be identical to misclassification. Objective: minimize

  • MSE: Mean squared error. Computes the distance from 1.0 to the probability suggested. So, say we have three classes, A, B, and C, and your model predicts a probability of 0.91 for A, 0.07 for B, and 0.02 for C. If the correct answer was A the \(MSE = 0.09^2 = 0.0081\), if it is B \(MSE = 0.93^2 = 0.8649\), if it is C \(MSE = 0.98^2 = 0.9604\). The squared component results in large differences in probabilities for the true class having larger penalties. Objective: minimize

  • Cross-entropy (aka Log Loss or Deviance): Similar to MSE but it incorporates a log of the predicted probability multiplied by the true class. Consequently, this metric disproportionately punishes predictions where we predict a small probability for the true class, which is another way of saying having high confidence in the wrong answer is really bad. Objective: minimize

  • Gini index: Mainly used with tree-based methods and commonly referred to as a measure of purity where a small value indicates that a node contains predominantly observations from a single class. Objective: minimize

When applying classification models, we often use a confusion matrix to evaluate certain performance measures. A confusion matrix is simply a matrix that compares actual categorical levels (or events) to the predicted categorical levels. When we predict the right level, we refer to this as a true positive. However, if we predict a level or event that did not happen this is called a false positive (i.e. we predicted a customer would redeem a coupon and they did not). Alternatively, when we do not predict a level or event and it does happen that this is called a false negative (i.e. a customer that we did not predict to redeem a coupon does).

We can extract different levels of performance for binary classifiers. For example, given the classification (or confusion) matrix illustrated above we can assess the following:

  • Accuracy: Overall, how often is the classifier correct? Opposite of misclassification above. Example: \(\frac{TP + TN}{total} = \frac{100+50}{165} = 0.91\). Objective: maximize

  • Precision: How accurately does the classifier predict events? This metric is concerned with maximizing the true positives to false positive ratio. In other words, for the number of predictions that we made, how many were correct? Example: \(\frac{TP}{TP + FP} = \frac{100}{100+10} = 0.91\). Objective: maximize

  • Sensitivity (aka recall): How accurately does the classifier classify actual events? This metric is concerned with maximizing the true positives to false negatives ratio. In other words, for the events that occurred, how many did we predict? Example: \(\frac{TP}{TP + FN} = \frac{100}{100+5} = 0.95\). Objective: maximize

  • Specificity: How accurately does the classifier classify actual non-events? Example: \(\frac{TN}{TN + FP} = \frac{50}{50+10} = 0.83\). Objective: maximize

  • AUC: Area under the curve. A good binary classifier will have high precision and sensitivity. This means the classifier does well when it predicts an event will and will not occur, which minimizes false positives and false negatives. To capture this balance, we often use a ROC curve that plots the false positive rate along the x-axis and the true positive rate along the y-axis. A line that is diagonal from the lower left corner to the upper right corner represents a random guess. The higher the line is in the upper left-hand corner, the better. AUC computes the area under this curve. Objective: maximize

The following are examples of computing the AUC for classification models developed on the Attrition data in both R and Python. Do not be too concerned with understanding all the nuances. The main thing to note is in both cases we follow a similar procedure of fitting our model, computing predicted values, and then comparing the the predicted values to the actual values.

# convert response to binary ints
train_strat["Attrition"].replace(('Yes', 'No'), (1, 0), inplace=True)

# separate features from labels
X_train_strat = train_strat[["DistanceFromHome"]]
y_train_strat = np.array(train_strat["Attrition"])

# fit a KNN regression model with 10 neighbors
knn2 = KNeighborsClassifier(n_neighbors=10)
m2 = knn2.fit(X_train_strat, y_train_strat)

# make predictions
pred = m2.predict_proba(X_train_strat)

# compute AUC
roc_auc_score(y_train_strat, pred[:, 1])
## 0.6159497525694708

# convert response variable to a factor
train_strat$Attrition <- as.factor(train_strat$Attrition)

# fit model to attrition data
m2 <- nearest_neighbor(neighbors = 10) %>%
  set_engine("kknn") %>%
  set_mode("classification") %>%
  fit(Attrition ~ DistanceFromHome, data = train_strat)

# make predictions
pred <- m2 %>%  predict(new_data = train_strat, type = "prob")

# compute AUC
roc_auc_vec(truth = train_strat$Attrition, estimate = pred$.pred_Yes, event_level = "second")
## [1] 0.5322834

Resampling methods

In the data splitting section we split our data into training and testing sets. Furthermore, we were very explicit about the fact that we do not use the test set to assess model performance during the training phase. So how do we assess the generalization performance of the model?

One option is to assess an error metric based on the training data, which demonstrated in the last section. Unfortunately, this leads to biased results as some models can perform very well on the training data but not generalize well to a new data set.

A second method is to use a validation approach, which involves splitting the training set further to create two parts: a training set and a validation set (or holdout set). We can then train our model(s) on the new training set and estimate the performance on the validation set. Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets (Molinaro, Simon, and Pfeiffer 2005; Hawkins, Basak, and Mills 2003). As the size of your data set reduces, this concern increases.

Although we stick to our definitions of test, validation, and holdout sets, these terms are sometimes used interchangeably in other literature and software. What’s important to remember is to always put a portion of the data under lock and key until a final model has been selected (we refer to this as the test data, but others refer to it as the holdout set).

Resampling methods provide an alternative approach by allowing us to repeatedly fit a model of interest to parts of the training data and test its performance on other parts. The two most commonly used resampling method include k-fold cross validation.

k-fold cross validation

k-fold cross-validation (aka k-fold CV) is a resampling method that randomly divides the training data into k groups (aka folds) of approximately equal size. The model is fit on \(k-1\) folds and then the remaining fold is used to compute model performance. This procedure is repeated k times; each time, a different fold is treated as the validation set. This process results in k estimates of the generalization error (say \(\epsilon_1, \epsilon_2, \dots, \epsilon_k\)). Thus, the k-fold CV estimate is computed by averaging the k test errors, providing us with an approximation of the error we might expect on unseen data.

Consequently, with k-fold CV, every observation in the training data will be held out one time to be included in the test set as illustrated in the figure below. In practice, one typically uses \(k = 5\) or \(k = 10\). There is no formal rule as to the size of k; however, as k gets larger, the difference between the estimated performance and the true performance to be seen on the test set will decrease. On the other hand, using too large k can introduce computational burdens. Moreover, Molinaro, Simon, and Pfeiffer (2005) found that \(k=10\) performed similarly to leave-one-out cross validation (LOOCV) which is the most extreme approach (i.e., setting \(k = n\)).

10-fold cross validation on 32 observations. Each observation is used once for validation and nine times for training.

10-fold cross validation on 32 observations. Each observation is used once for validation and nine times for training.

Although using \(k \geq 10\) helps to minimize the variability in the estimated performance, k-fold CV still tends to have higher variability than bootstrapping. Kim (2009) showed that repeating k-fold CV can help to increase the precision of the estimated generalization error. Consequently, for smaller data sets (say \(n < 10,000\)), 10-fold CV repeated 5 or 10 times will improve the accuracy of your estimated performance and also provide an estimate of its variability.

In Python we use KFold, RepeatedKFold to create k-fold objects and then cross_val_score to train our model across all k folds and provide our loss score for each fold.

The unified scoring API in scikit-learn always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. Consequently, you can just interpret the RMSE values below as the \(RMSE \times -1\).

# define loss function
loss = 'neg_root_mean_squared_error'

# create 10 fold CV object
kfold = KFold(n_splits=10, random_state=123, shuffle=True)

# fit model with 10-fold CV
results = cross_val_score(m1, X_train, y_train, cv=kfold, scoring=loss)
results
## array([-45136.20437337, -38183.69975476, -54169.69935913, -45007.8848804 ,
##        -43120.93684721, -51383.34925249, -44315.09008126, -44256.86347971,
##        -46078.66311132, -56890.28665348])
# summary stats for all 10 folds
results.min()
## -56890.28665348469
results.max()
## -38183.699754758076
results.mean()
## -46854.26777931303
results.std()
## 5329.377726381466
# 10 fold cross validation repated 5 times (total of 50 folds)
rfk = RepeatedKFold(n_splits=10, n_repeats=5, random_state=123)
results = cross_val_score(m1, X_train, y_train, cv=rfk, scoring=loss)
results
## array([-45136.20437337, -38183.69975476, -54169.69935913, -45007.8848804 ,
##        -43120.93684721, -51383.34925249, -44315.09008126, -44256.86347971,
##        -46078.66311132, -56890.28665348, -40017.91401001, -45828.46114923,
##        -47228.29689018, -46574.38484149, -52644.55477255, -42413.80172089,
##        -50421.09806576, -56747.68115255, -37440.54107935, -51244.11410457,
##        -47842.19983968, -41151.85985034, -42691.52559651, -50722.75189446,
##        -44479.58978808, -58290.46500997, -57282.16096375, -47068.2904626 ,
##        -39169.01810308, -46384.66365693, -45112.38095102, -48230.91228952,
##        -43541.52791392, -55709.18621144, -41914.9680488 , -47147.05729334,
##        -50511.91358351, -47055.73545207, -47265.80563254, -47956.27164465,
##        -54570.55728204, -43444.77925654, -50092.81342186, -39060.25044936,
##        -54300.45851559, -45321.71952644, -39325.38043486, -53356.9202388 ,
##        -49735.05010088, -45332.77656026])
# average RMSE across all 50 folds
results.mean()
## -47263.450311050954

# create 10 fold CV object
kfold <- vfold_cv(train, v = 10)
results <- fit_resamples(knn, Sale_Price ~ Gr_Liv_Area + Year_Built, kfold)

# see RMSE for all folds
collect_metrics(results, summarize = FALSE) %>% filter(.metric == "rmse")
## # A tibble: 10 x 5
##    id     .metric .estimator .estimate .config             
##    <chr>  <chr>   <chr>          <dbl> <chr>               
##  1 Fold01 rmse    standard      37637. Preprocessor1_Model1
##  2 Fold02 rmse    standard      51074. Preprocessor1_Model1
##  3 Fold03 rmse    standard      43234. Preprocessor1_Model1
##  4 Fold04 rmse    standard      32552. Preprocessor1_Model1
##  5 Fold05 rmse    standard      38617. Preprocessor1_Model1
##  6 Fold06 rmse    standard      37019. Preprocessor1_Model1
##  7 Fold07 rmse    standard      39953. Preprocessor1_Model1
##  8 Fold08 rmse    standard      37197. Preprocessor1_Model1
##  9 Fold09 rmse    standard      43373. Preprocessor1_Model1
## 10 Fold10 rmse    standard      40979. Preprocessor1_Model1
# average RMSE
collect_metrics(results, summarize = TRUE)
## # A tibble: 2 x 6
##   .metric .estimator      mean     n   std_err .config             
##   <chr>   <chr>          <dbl> <int>     <dbl> <chr>               
## 1 rmse    standard   40164.       10 1581.     Preprocessor1_Model1
## 2 rsq     standard       0.749    10    0.0111 Preprocessor1_Model1
# 10 fold cross validation repated 5 times (total of 50 folds)
rfk <- vfold_cv(train, v = 10, repeats = 5)
results <- fit_resamples(knn, Sale_Price ~ Gr_Liv_Area + Year_Built, rfk)

# see RMSE for all folds
collect_metrics(results, summarize = FALSE) %>% filter(.metric == "rmse")
## # A tibble: 50 x 6
##    id      id2    .metric .estimator .estimate .config             
##    <chr>   <chr>  <chr>   <chr>          <dbl> <chr>               
##  1 Repeat1 Fold01 rmse    standard      41125. Preprocessor1_Model1
##  2 Repeat1 Fold02 rmse    standard      39116. Preprocessor1_Model1
##  3 Repeat1 Fold03 rmse    standard      43469. Preprocessor1_Model1
##  4 Repeat1 Fold04 rmse    standard      43534. Preprocessor1_Model1
##  5 Repeat1 Fold05 rmse    standard      42918. Preprocessor1_Model1
##  6 Repeat1 Fold06 rmse    standard      41332. Preprocessor1_Model1
##  7 Repeat1 Fold07 rmse    standard      41355. Preprocessor1_Model1
##  8 Repeat1 Fold08 rmse    standard      35364. Preprocessor1_Model1
##  9 Repeat1 Fold09 rmse    standard      34076. Preprocessor1_Model1
## 10 Repeat1 Fold10 rmse    standard      41396. Preprocessor1_Model1
## # … with 40 more rows
# average RMSE
collect_metrics(results, summarize = TRUE)
## # A tibble: 2 x 6
##   .metric .estimator      mean     n   std_err .config             
##   <chr>   <chr>          <dbl> <int>     <dbl> <chr>               
## 1 rmse    standard   40377.       50 531.      Preprocessor1_Model1
## 2 rsq     standard       0.749    50   0.00514 Preprocessor1_Model1

Bias variance trade-off

Prediction errors can be decomposed into two important subcomponents: error due to “bias” and error due to “variance.” There is often a tradeoff between a model’s ability to minimize bias and variance. Understanding how different sources of error lead to bias and variance helps us improve the data fitting process resulting in more accurate models.

Bias

Bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. It measures how far off in general a model’s predictions are from the correct value, which provides a sense of how well a model can conform to the underlying structure of the data. Figure @ref(fig:modeling-process-bias-model) illustrates an example where the polynomial model does not capture the underlying structure well. Linear models are classical examples of high bias models as they are less flexible and rarely capture non-linear, non-monotonic relationships.

We also need to think of bias-variance in relation to resampling. Models with high bias are rarely affected by the noise introduced by resampling. If a model has high bias, it will have consistency in its resampling performance as illustrated below:

__Figure__: A biased polynomial model fit to a single data set does not capture the underlying non-linear, non-monotonic data structure (left).  Models fit to 25 bootstrapped replicates of the data are underterred by the noise and generates similar, yet still biased, predictions (right).

Figure: A biased polynomial model fit to a single data set does not capture the underlying non-linear, non-monotonic data structure (left). Models fit to 25 bootstrapped replicates of the data are underterred by the noise and generates similar, yet still biased, predictions (right).

Variance

On the other hand, error due to variance is defined as the variability of a model prediction for a given data point. Many models (e.g., k-nearest neighbor, decision trees, gradient boosting machines) are very adaptable and offer extreme flexibility in the patterns that they can fit to. However, these models offer their own problems as they run the risk of overfitting to the training data. Although you may achieve very good performance on your training data, the model will not automatically generalize well to unseen data.

__Figure__: A high variance _k_-nearest neighbor model fit to a single data set captures the underlying non-linear, non-monotonic data structure well but also overfits to individual data points (left).  Models fit to 25 bootstrapped replicates of the data are deterred by the noise and generate highly variable predictions (right).

Figure: A high variance k-nearest neighbor model fit to a single data set captures the underlying non-linear, non-monotonic data structure well but also overfits to individual data points (left). Models fit to 25 bootstrapped replicates of the data are deterred by the noise and generate highly variable predictions (right).

Since high variance models are more prone to overfitting, using resampling procedures are critical to reduce this risk. Moreover, many algorithms that are capable of achieving high generalization performance have lots of hyperparameters that control the level of model complexity (i.e., the tradeoff between bias and variance).

Hyperparameter tuning

Hyperparameters (aka tuning parameters) are the “knobs to twiddle”4 to control the complexity of machine learning algorithms and, therefore, the bias-variance trade-off. Not all algorithms have hyperparameters (e.g., ordinary least squares5); however, most have at least one or more.

The proper setting of these hyperparameters is often dependent on the data and problem at hand and cannot always be estimated by the training data alone. Consequently, we need a method of identifying the optimal setting. For example, in the high variance example in the previous section, we illustrated a high variance k-nearest neighbor model. k-nearest neighbor models have a single hyperparameter (k) that determines the predicted value to be made based on the k nearest observations in the training data to the one being predicted. If k is small (e.g., \(k=3\)), the model will make a prediction for a given observation based on the average of the response values for the 3 observations in the training data most similar to the observation being predicted. This often results in highly variable predicted values because we are basing the prediction (in this case, an average) on a very small subset of the training data. As k gets bigger, we base our predictions on an average of a larger subset of the training data, which naturally reduces the variance in our predicted values (remember this for later, averaging often helps to reduce variance!). The figure below illustrates this point. Smaller k values (e.g., 2, 5, or 10) lead to high variance (but lower bias) and larger values (e.g., 150) lead to high bias (but lower variance). The optimal k value might exist somewhere between 20–50, but how do we know which value of k to use?

__Figure__: _k_-nearest neighbor model with differing values for _k_.

Figure: k-nearest neighbor model with differing values for k.

One way to perform hyperparameter tuning is to fiddle with hyperparameters manually until you find a great combination of hyperparameter values that result in high predictive accuracy (as measured using k-fold CV, for instance). However, this can be very tedious work depending on the number of hyperparameters. An alternative approach is to perform a grid search. A grid search is an automated approach to searching across many combinations of hyperparameter values.

For the simple example above, a grid search would predefine a candidate set of values for k (e.g., \(k = 1, 2, \dots, j\)) and perform a resampling method (e.g., k-fold CV) to estimate which k value generalizes the best to unseen data. The plots in the below examples illustrate the results from a grid search to assess \(k = 3, 5, \dots, 150\) using repeated 10-fold CV. The error rate displayed represents the average error for each value of k across all the repeated CV folds. On average, \(k=46\) was the optimal hyperparameter value to minimize error (in this case, RMSE which will be discussed shortly) on unseen data.

__Figure__: Results from a grid search for a _k_-nearest neighbor model assessing values for _k_ ranging from 3-25.  We see high error values due to high model variance when _k_ is small and we also see high errors values due to high model bias when _k_ is large.  The optimal model is found at _k_ = 46.

Figure: Results from a grid search for a k-nearest neighbor model assessing values for k ranging from 3-25. We see high error values due to high model variance when k is small and we also see high errors values due to high model bias when k is large. The optimal model is found at k = 46.

Throughout this course you’ll be exposed to different approaches to performing grid searches. In the above example, we used a full cartesian grid search, which assesses every hyperparameter value manually defined. However, as models get more complex and offer more hyperparameters, this approach can become computationally burdensome and requires you to define the optimal hyperparameter grid settings to explore. Additional approaches we’ll illustrate include random grid searches (Bergstra and Bengio 2012) which explores randomly selected hyperparameter values from a range of possible values, early stopping which allows you to stop a grid search once reduction in the error stops marginally improving, adaptive resampling via futility analysis (Kuhn 2014) which adaptively resamples candidate hyperparameter values based on approximately optimal performance, and more.

The following provides an example of a full cartesian grid search.

In Python, we perform a grid search with GridSearchCV() and supply it a model object and hyperparameter values we want to assess. You’ll also notice that we supply it with the kfold object we created previously and the loss function we want to optimize for.

# basic model object
knn = KNeighborsRegressor()

# Create grid of hyperparameter values
hyper_grid = {'n_neighbors': range(2, 26)}

# Tune a knn model using grid search
grid_search = GridSearchCV(knn, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train, y_train)

# Best model's cross validated RMSE
abs(results.best_score_)
## 46635.769325114525
# Best model's k value
results.best_estimator_.get_params().get('n_neighbors')
## 13
# Plot all RMSE results
all_rmse = pd.DataFrame({'k': range(2, 26), 
                         'RMSE': np.abs(results.cv_results_['mean_test_score'])})

(ggplot(all_rmse, aes(x='k', y='RMSE'))
 + geom_line()
 + geom_point()
 + ggtitle("Cross validated grid search results"))

In R, we perform a grid search with tune_grid() and supply it a model object and hyperparameter values we want to assess. You’ll also notice that we supply it with the kfold object we created previously and a recipe() object. In the next module we’ll discuss the concept of recipes but for now just think of it as another way to pass our specified function form.

# model object
knn <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("regression")

# Create grid of hyperparameter values
hyper_grid <- expand.grid(neighbors = seq(2, 25, by = 1))

# model recipe
model_form <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built, data = train)

# Tune a knn model using grid search
results <- tune_grid(knn, model_form, resamples = kfold, grid = hyper_grid)

show_best(results, metric = "rmse")
## # A tibble: 5 x 7
##   neighbors .metric .estimator   mean     n std_err .config              
##       <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
## 1        19 rmse    standard   39960.    10   1720. Preprocessor1_Model18
## 2        20 rmse    standard   39961.    10   1739. Preprocessor1_Model19
## 3        17 rmse    standard   39962.    10   1671. Preprocessor1_Model16
## 4        21 rmse    standard   39968.    10   1759. Preprocessor1_Model20
## 5        18 rmse    standard   39969.    10   1695. Preprocessor1_Model17
results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(neighbors, mean)) +
  geom_line() +
  geom_point() +
  labs(x = "k", y = "RMSE", title = "Cross validated grid search results")

Putting the processes together

You’ve now been exposed to many of the fundamental pieces of an ML process. The following combines these code snippets into a larger recipe to show you how they all come together. Rather than just look at the 2 features that we included thus far (Gr_Liv_Area & Year_Built), we’ll include all numeric features.

To include the categorical features as well we will need to do some feature engineering, which we will discuss in the next session.

There will be differences in performance between R and Python because:

  1. The train/test splits will include different observations.
  2. The k folds will contain different observations.
  3. The default KNN models have additional parameters we didn’t discuss that will differ in their default settings between languages.

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.select_dtypes(include='number').drop("Sale_Price", axis=1)
y_train = train["Sale_Price"]

# create KNN model object
knn = KNeighborsRegressor()

# define loss function
loss = 'neg_root_mean_squared_error'

# create 10 fold CV object
kfold = KFold(n_splits=10, random_state=123, shuffle=True)

# Create grid of hyperparameter values
hyper_grid = {'n_neighbors': range(2, 26)}

# Tune a knn model using grid search
grid_search = GridSearchCV(knn, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train, y_train)

# Best model's cross validated RMSE
abs(results.best_score_)
## 41915.408581298376
# Best model's k value
results.best_estimator_.get_params().get('n_neighbors')
## 5
# Plot all RMSE results
all_rmse = pd.DataFrame({'k': range(2, 26), 
                         'RMSE': np.abs(results.cv_results_['mean_test_score'])})

(ggplot(all_rmse, aes(x='k', y='RMSE'))
 + geom_line()
 + geom_point()
 + ggtitle("Cross validated grid search results"))
## <ggplot: (318264143)>

# create train/test split
set.seed(123)  # for reproducibility
split  <- initial_split(ames, prop = 0.7)
train  <- training(split)
test   <- testing(split)

# select only numeric features
train <- train %>% select_if(is.numeric)

# create resampling procedure
kfold <- vfold_cv(train, v = 10)

# model object
knn <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("regression")

# Create grid of hyperparameter values
hyper_grid <- expand.grid(neighbors = seq(2, 25, by = 1))

# model recipe - Sale price is a function of all numeric features
model_form <- recipe(Sale_Price ~ ., data = train)

# Tune a knn model using grid search
results <- tune_grid(knn, model_form, resamples = kfold, grid = hyper_grid)

# best model
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
##   neighbors .metric .estimator   mean     n std_err .config              
##       <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
## 1        18 rmse    standard   34202.    10   1732. Preprocessor1_Model17
## 2        17 rmse    standard   34203.    10   1748. Preprocessor1_Model16
## 3        19 rmse    standard   34210.    10   1718. Preprocessor1_Model18
## 4        16 rmse    standard   34216.    10   1766. Preprocessor1_Model15
## 5        20 rmse    standard   34220.    10   1705. Preprocessor1_Model19
# plot results
results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(neighbors, mean)) +
  geom_line() +
  geom_point() +
  labs(x = "k", y = "RMSE", title = "Cross validated grid search results")

Exercises

  1. Load the Boston housing data set and split it into a training set and test set using a 70-30% split.

    • How many observations are in the training set and test set?
    • Compare the distribution of cmedv between the training set and test set.
  2. Fit a KNN model where \(k=5\) that uses all available features to predict cmedv. How does the MSE/RMSE compare across these models?

  3. Perform a 10-fold cross-validated KNN model, repeated 5 times, that uses all available features to predict cmedv.

    • What is the average RMSE across all 50 model iterations?
    • Plot the distribution of the RMSE across all 50 model iterations.
    • Describe the results.
  4. Now perform a hyperparameter grid search where k ranges from 2–20 and apply 10-fold CV repeated 5 times.

🏠

References

Bergstra, James, and Yoshua Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research 13 (Feb): 281–305.
Breiman, Leo, and others. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231.
Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57.
Hawkins, Douglas M, Subhash C Basak, and Denise Mills. 2003. “Assessing Model Fit by Cross-Validation.” Journal of Chemical Information and Computer Sciences 43 (2): 579–86.
Kim, Ji-Hyun. 2009. “Estimating Classification Error Rate: Repeated Cross-Validation, Repeated Hold-Out and Bootstrap.” Computational Statistics & Data Analysis 53 (11): 3735–45.
Kuhn, Max. 2014. “Futility Analysis in the Cross-Validation of Machine Learning Models.” arXiv Preprint arXiv:1405.6974.
Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.
Molinaro, Annette M, Richard Simon, and Ruth M Pfeiffer. 2005. “Prediction Error Estimation: A Comparison of Resampling Methods.” Bioinformatics 21 (15): 3301–7.
Wolpert, David H. 1996. “The Lack of a Priori Distinctions Between Learning Algorithms.” Neural Computation 8 (7): 1341–90.

  1. See https://www.fatml.org/resources/relevant-scholarship for many discussions regarding implications of poorly applied and interpreted ML.↩︎

  2. This deviates slightly from the usual definition of MSE in ordinary linear regression, where we divide by \(n-p\) (to adjust for bias) as opposed to \(n\).↩︎

  3. See this StackExchange thread (http://bit.ly/what-is-deviance) for a good overview of deviance for different models and in the context of regression versus classification.↩︎

  4. This phrase comes from Brad Efron’s comments in Breiman and others (2001)↩︎

  5. At least in the ordinary sense. You could think of polynomial regression as having a single hyperparameter, the degree of the polynomial.↩︎