Linear models (LMs) provide a simple, yet effective, approach to predictive modeling. Moreover, when certain assumptions required by LMs are met (e.g., constant variance), the estimated coefficients are unbiased and, of all linear unbiased estimates, have the lowest variance. However, in today’s world, data sets being analyzed typically contain a large number of features. As the number of features grow, certain assumptions typically break down and these models tend to overfit the training data, causing our out of sample error to increase. Regularization methods provide a means to constrain or regularize the estimated coefficients, which can reduce the variance and decrease out of sample error.
By the end of this module you will know how to:
# Helper packages
import numpy as np
import pandas as pd
from plotnine import *
# Modeling packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
# Minimize convergence warning messages
import warnings
warnings.filterwarnings("ignore")
# Ames housing data
ames = pd.read_csv("data/ames.csv")
# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)
# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]
# Helper packages
library(tidyverse) # general data munging & visualization
# Modeling packages
library(tidymodels)
# Model interpretability packages
library(vip) # for variable importance
# Stratified sampling with the rsample package
ames <- AmesHousing::make_ames()
set.seed(123) # for reproducibility
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)
The easiest way to understand regularized regression is to explain how and why it is applied to ordinary least squares (OLS). The objective in OLS regression is to find the hyperplane (e.g., a straight line in two dimensions) that minimizes the sum of squared errors (SSE) between the observed and predicted response values (see Figure below). This means identifying the hyperplane that minimizes the grey lines, which measure the vertical distance between the observed (red dots) and predicted (blue line) response values.
More formally, the objective function being minimized can be written as:
\[\begin{equation} \text{minimize} \left( SSE = \sum^n_{i=1} \left(y_i - \hat{y}_i\right)^2 \right) \end{equation}\]
As we discussed in the linear regression module, the OLS objective function performs quite well when our data adhere to a few key assumptions:
For classical statistical inference procedures (e.g., confidence intervals based on the classic t-statistic) to be valid, we also need to make stronger assumptions regarding normality (of the errors) and homoscedasticity (i.e., constant error variance).
Many real-life data sets, like those common to text mining and genomic studies are wide, meaning they contain a larger number of features (\(p > n\)). As p increases, we’re more likely to violate some of the OLS assumptions and alternative approaches should be considered. This was briefly illustrated in the linear regression module where the presence of multicollinearity was diminishing the interpretability of our estimated coefficients due to inflated variance. By reducing multicollinearity, we were able to increase our model’s accuracy. Of course, multicollinearity can also occur when \(n > p\).
Having a large number of features invites additional issues in using classic regression models. For one, having a large number of features makes the model much less interpretable. Additionally, when \(p > n\), there are many (in fact infinite) solutions to the OLS problem! In such cases, it is useful (and practical) to assume that a smaller subset of the features exhibit the strongest effects (something called the bet on sparsity principle (see Hastie, Tibshirani, and Wainwright 2015, 2).). For this reason, we sometimes prefer estimation techniques that incorporate feature selection. One approach to this is called hard thresholding feature selection, which includes many of the traditional linear model selection approaches like forward selection and backward elimination. These procedures, however, can be computationally inefficient, do not scale well, and treat a feature as either in or out of the model (hence the name hard thresholding). In contrast, a more modern approach, called soft thresholding, slowly pushes the effects of irrelevant features toward zero, and in some cases, will zero out entire coefficients. As will be demonstrated, this can result in more accurate models that are also easier to interpret.
With wide data (or data that exhibits multicollinearity), one alternative to OLS regression is to use regularized regression (also commonly referred to as penalized models or shrinkage methods as in Friedman, Hastie, and Tibshirani (2001) and Kuhn and Johnson (2013)) to constrain the total size of all the coefficient estimates. This constraint helps to reduce the magnitude and fluctuations of the coefficients and will reduce the variance of our model (at the expense of no longer being unbiased—a reasonable compromise).
The objective function of a regularized regression model is similar to OLS, albeit with a penalty term \(P\).
\[\begin{equation} \text{minimize} \left( SSE + P \right) \end{equation}\]
This penalty parameter constrains the size of the coefficients such that the only way the coefficients can increase is if we experience a comparable decrease in the sum of squared errors (SSE).
This concept generalizes to all GLM models (e.g., logistic and Poisson regression) and even some survival models. So far, we have been discussing OLS and the sum of squared errors loss function. However, different models within the GLM family have different loss functions (see Chapter 4 of Friedman, Hastie, and Tibshirani (2001)). Yet we can think of the penalty parameter all the same—it constrains the size of the coefficients such that the only way the coefficients can increase is if we experience a comparable decrease in the model’s loss function.
There are three common penalty parameters we can implement:
Ridge regression (Hoerl and Kennard 1970) controls the estimated coefficients by adding \(\lambda \sum^p_{j=1} \beta_j^2\) to the objective function.
\[\begin{equation} \text{minimize } \left( SSE + \lambda \sum^p_{j=1} \beta_j^2 \right) \end{equation}\]
The size of this penalty, referred to as \(L^2\) (or Euclidean) norm, can take on a wide range of values, which is controlled by the tuning parameter \(\lambda\). When \(\lambda = 0\) there is no effect and our objective function equals the normal OLS regression objective function of simply minimizing SSE. However, as \(\lambda \rightarrow \infty\), the penalty becomes large and forces the coefficients toward zero (but not all the way). This is illustrated below where exemplar coefficients have been regularized with \(\lambda\) ranging from 0 to over 8,000.
Although these coefficients were scaled and centered prior to the analysis, you will notice that some are quite large when \(\lambda\) is near zero. Furthermore, you’ll notice that feature x1
has a large negative parameter that fluctuates until \(\lambda \approx 7\) where it then continuously shrinks toward zero. This is indicative of multicollinearity and likely illustrates that constraining our coefficients with \(\lambda > 7\) may reduce the variance, and therefore the error, in our predictions.
In essence, the ridge regression model pushes many of the correlated features toward each other rather than allowing for one to be wildly positive and the other wildly negative. In addition, many of the less-important features also get pushed toward zero. This helps to provide clarity in identifying the important signals in our data.
However, ridge regression does not perform feature selection and will retain all available features in the final model. Therefore, a ridge model is good if you believe there is a need to retain all features in your model yet reduce the noise that less influential variables may create (e.g., in smaller data sets with severe multicollinearity). If greater interpretation is necessary and many of the features are redundant or irrelevant then a lasso or elastic net penalty may be preferable.
The lasso (least absolute shrinkage and selection operator) penalty (Tibshirani 1996) is an alternative to the ridge penalty that requires only a small modification. The only difference is that we swap out the \(L^2\) norm for an \(L^1\) norm: \(\lambda \sum^p_{j=1} | \beta_j|\):
\[\begin{equation} \text{minimize } \left( SSE + \lambda \sum^p_{j=1} | \beta_j | \right) \end{equation}\]
Whereas the ridge penalty pushes variables to approximately but not equal to zero, the lasso penalty will actually push coefficients all the way to zero as illustrated in below. Switching to the lasso penalty not only improves the model but it also conducts automated feature selection.
In the figure above we see that when \(\lambda < 0.01\) all 15 variables are included in the model, when \(\lambda \approx 0.5\) 9 variables are retained, and when \(log\left(\lambda\right) = 1\) only 5 variables are retained. Consequently, when a data set has many features, lasso can be used to identify and extract those features with the largest (and most consistent) signal.
A generalization of the ridge and lasso penalties, called the elastic net (Zou and Hastie 2005), combines the two penalties:
\[\begin{equation} \text{minimize } \left( SSE + \lambda_1 \sum^p_{j=1} \beta_j^2 + \lambda_2 \sum^p_{j=1} | \beta_j | \right) \end{equation}\]
Although lasso models perform feature selection, when two strongly correlated features are pushed towards zero, one may be pushed fully to zero while the other remains in the model. Furthermore, the process of one being in and one being out is not very systematic. In contrast, the ridge regression penalty is a little more effective in systematically handling correlated features together. Consequently, the advantage of the elastic net penalty is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty.
We will start by applying a simple ridge model but within each language tab we discuss how to apply a simple lasso and elastic net as well. In this example we simplify and focus on only four features: Gr_Liv_Area
, Year_Built
, Garage_Cars
, and Garage_Area
. In this example we will apply a \(\lambda\) penalty parameter equal to 1.
Since regularized methods apply a penalty to the coefficients, we need to ensure our coefficients are on a common scale. If not, then predictors with naturally larger values (e.g., total square footage) will be penalized more than predictors with naturally smaller values (e.g., total number of rooms).
The below applies a ridge model but you could just as easily apply a simple lasso and elastic net model with Lasso()
and ElasticNet()
. The alpha
parameter represents the \(\lambda\) penalty parameter.
Ridge()
, Lasso()
, and ElasticNet()
all provide a normalize
parameter that you can use to normalize your features. However, the features will be normalized before regression by subtracting the mean and dividing by the l2-norm rather than the mean. If you wish to standardize, we need to use the StandardScaler
preprocessor before calling fit on an estimator with normalize=False
.
# Step 1: get features of interest
features = X_train[["Gr_Liv_Area", "Year_Built", "Garage_Cars", "Garage_Area"]]
# Step 2: standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(features)
# Step 3: create model object
ridge_mod = Ridge(alpha=1)
# Step 4: fit/train model
ridge_fit = ridge_mod.fit(X_train_scaled, y_train)
ridge_fit.coef_
## array([[39936.69788684, 24309.03872837, 7826.82312427, 14097.84660676]])
There are a few engines that allow us to apply regularized models. The most popular is glmnet. In R, the \(\lambda\) penalty parameter is represented by penalty
and mixture
controls the type of penalty (0 = ridge, 1 = lasso, and any value in between represents an elastic net).
# Step 1: create ridge model object
ridge_mod <- linear_reg(penalty = 1, mixture = 0) %>%
set_engine("glmnet")
# Step 2: create model & preprocessing recipe
model_recipe <- recipe(
Sale_Price ~ Gr_Liv_Area + Year_Built + Garage_Cars + Garage_Area,
data = ames_train
) %>%
step_normalize(all_predictors())
# Step 3: fit model workflow
ridge_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(ridge_mod) %>%
fit(data = ames_train)
# Step 4: extract and tidy results
ridge_fit %>%
pull_workflow_fit() %>%
tidy()
## # A tibble: 5 x 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 (Intercept) 180923. 1
## 2 Gr_Liv_Area 37441. 1
## 3 Year_Built 22587. 1
## 4 Garage_Cars 10825. 1
## 5 Garage_Area 11495. 1
There are two main tuning parameters to consider in regularized models: the strength of the regularization parameter (\(\lambda\)) and the type of penalty (ridge, lasso, elastic net).
First, we’ll asses how the regularization strength impacts the performance of our model. In the examples that follow we will use all the features in our Ames housing data; however, we’ll stick to the basic preprocessing of standarizing our numeric features and one-hot encoding our categorical features.
First, we will standardize and one-hot encode our features.
# create new feature set with encoded features
preprocessor = ColumnTransformer(
remainder="passthrough",
transformers=[
("scale", StandardScaler(), selector(dtype_include="number")),
("one-hot", OneHotEncoder(), selector(dtype_include="object"))
])
X_train_encoded = preprocessor.fit_transform(X_train)
Next, we create a standaard Ridge model object. This same procedure could be done using Lasso()
or ElasticNet()
. The last step in this code chunk is to create a tuning grid of penalty parameter values to assess. This example will look at 50 values ranging from 5.0e-06 to 5.0e+09.
# create model object
ridge_mod = Ridge()
# define loss function
loss = 'neg_root_mean_squared_error'
# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)
# Create grid of hyperparameter values
hyper_grid = {'alpha': 10**np.linspace(10, -5, 50)*0.5}
Now we search for the optimal value using 5-fold cross validation procedure. We see that our optimal regularization strength is:
grid_search = GridSearchCV(ridge_mod, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train_encoded, y_train)
# Optimal penalty parameter in grid search
results.best_estimator_
## Ridge(alpha=6.628556827950554)
Which provides a CV RMSE of the following. With minimal feature engineering this is already the best model performance we’ve obtained compared to previous modules.
# Best model's cross validated RMSE
round(abs(results.best_score_), 2)
## 30085.16
In R, we’ll use penalty = tune()
when we create our model object. This allows us to create a placeholder for the parameter we want to tune. We also create a tuning grid containing 50 different lambda penalty values to assess. These values range from 1.0e-05 to 1.0+5 (input range is -10 to 5 but these values are log10 transformed).
Our final results are very similar to those found in the Python code chunks. We see that the optimal penalty parameter produces a 5-fold CV RMSE of just under $31K, which is the best model performance we’ve obtained compared to previous modules.
# create linear model object
ridge_mod <- linear_reg(penalty = tune(), mixture = 0) %>%
set_engine("glmnet")
# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)
# create our model recipe with a tuning option for number of components
model_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# create a hyper parameter tuning grid for penalty
hyper_grid <- grid_regular(penalty(range = c(-10, 5)), levels = 50)
# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(ridge_mod, model_recipe, resamples = folds, grid = hyper_grid)
# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 24421. rmse standard 31620. 5 3051. Preprocessor1_Model48
## 2 49417. rmse standard 31655. 5 2645. Preprocessor1_Model49
## 3 12068. rmse standard 32049. 5 3353. Preprocessor1_Model47
## 4 5964. rmse standard 32649. 5 3579. Preprocessor1_Model46
## 5 100000 rmse standard 32684. 5 2121. Preprocessor1_Model50
Often, the optimal model includes a penalty that is some combination of both \(L^1\) and \(L^2\) (elastic net), thus we want to tune both the penalty strength and the type of penalty. The following performs this grid search.
In Python we’ll use ElasticNet()
. ElasticNet()
includes an l1_ratio
parameter that can be set between 0-1. When l1_ratio=1
the results are the same as Ridge()
, when l1_ratio=0
the results are the same as Lasso()
, and values in between performs an elastic net with a specified \(L^1\) and \(L^2\) ratios (i.e. l1_ratio=0.75
uses a 75% \(L^1\) and 25% \(L^2\) penalty).
You may need to increase the tol
parameter so the algorithm converges. This is a common concern for lasso models in sci-kit learn. The default tol
erance level is 0.0001, which I increase in the below model. See here for more discussion.
# create model object
mod = ElasticNet(tol=0.01)
# Create grid of hyperparameter values
hyper_grid = {
'alpha': 10**np.linspace(10, -5, 10)*0.5,
'l1_ratio': (0, 0.25, 0.5, 0.75 , 1)
}
# 5-fold CV grid search
grid_search = GridSearchCV(mod, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train_encoded, y_train)
# Optimal penalty parameter in grid search
results.best_estimator_
## ElasticNet(alpha=23.207944168063865, l1_ratio=1, tol=0.01)
# Best model's cross validated RMSE
round(abs(results.best_score_), 2)
## 29552.31
In R we use tune()
to tune the penalty strength (penalty
) and the combination (aka mixture
) of \(L^1\) and \(L^2\) penalty. We do so by creating a hyper grid that not only includes the penalty strength values but also 5 values of \(L^1\) and \(L^2\) combinations (0, 0.25, 0.50, 0.75, 1).
# create linear model object
mod <- linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)
# create our model recipe with a tuning option for number of components
model_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# create a hyper parameter tuning grid for penalty
hyper_grid <- grid_regular(
penalty(range = c(-10, 5)),
mixture(),
levels = c(mixture = 5, penalty = 50)
)
# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(mod, model_recipe, resamples = folds, grid = hyper_grid)
# get best results
show_best(results)
## # A tibble: 5 x 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 24421. 0 rmse standard 31992. 5 2296. Preprocessor1_Model048
## 2 49417. 0 rmse standard 32032. 5 1969. Preprocessor1_Model049
## 3 12068. 0 rmse standard 32390. 5 2493. Preprocessor1_Model047
## 4 2947. 0.25 rmse standard 32487. 5 2534. Preprocessor1_Model095
## 5 1456. 0.5 rmse standard 32692. 5 2601. Preprocessor1_Model144
On thing we have not talked about yet is, once we have a final model how do we know which features are most influential to our model. This concept is known as variable/feature importance. For multiple linear regression models, this is most often measured by the absolute value of the t-statistic for each model parameter used; though simple, the results can be hard to interpret when the model includes interaction effects and complex transformations.
For regularized models, importance is determined by magnitude of the standardized coefficients. Recall that ridge, lasso, and elastic net models push non-influential features to zero (or near zero). Consequently, very small coefficient values represent features that are not very important while very large coefficient values (whether negative or positive) represent very important features.
In Python we need to do a manual process to extract the feature names. We than get the best CV model’s coefficients and combine into a DataFrame. For now we’ll just focus on the top 25 most influential features in our model.
# get feature names
feat_names = ColumnTransformer(
remainder="passthrough",
transformers=[
("one-hot", OneHotEncoder(), selector(dtype_include="object"))
]).fit(X_train).get_feature_names()
# create DataFrame with feature name and coefficients from best model
coef = pd.DataFrame({'feature': feat_names,
'coef': results.best_estimator_.coef_})
# indicate if coefficient is positive or negative
coef['abs_coef'] = coef['coef'].abs()
coef['impact'] = np.where(coef['coef']>0, 'positive', 'negative')
# filter for the top 25
top_25_features = coef.nlargest(25, 'abs_coef')
top_25_features.head()
## feature coef abs_coef impact
## 163 one-hot__x19_Stone -550233.433278 550233.433278 negative
## 327 Bsmt_Full_Bath -390163.485390 390163.485390 negative
## 121 one-hot__x15_Very_Good -117117.158839 117117.158839 negative
## 89 one-hot__x11_RRAn 99099.308207 99099.308207 positive
## 317 Year_Remod_Add 80523.104479 80523.104479 positive
We can then take this feature importance DataFrame and plot the coefficient magnitudes. We can see two very influential features, both of which are negative, and then another 6 strongly influential features before tailing off to much smaller magnitudes.
# plot feature importance
(ggplot(top_25_features, aes(x='abs_coef',
y='reorder(feature, abs_coef)',
color='impact',
fill='impact'))
+ geom_point()
+ labs(y=None))
## <ggplot: (320018390)>
In R we can use the vip package to simplify the process of extracting feature importance.
# identify best model
lowest_rmse <- results %>%
select_best("rmse")
# extract best model workflow
final_model <- finalize_workflow(
workflow() %>% add_recipe(model_recipe) %>% add_model(ridge_mod),
lowest_rmse)
# extract feature importance for top 25 most influential features
top_25_features <- final_model %>%
fit(ames_train) %>%
pull_workflow_fit() %>%
vi(lambda = lowest_rmse$penalty) %>%
mutate(
Importance = abs(Importance),
Variable = fct_reorder(Variable, Importance)
) %>%
top_n(25, wt = Importance)
And finally we can plot the feature importance to identify those features most influential (whether positive or negative) in predicting outcomes.
ggplot(top_25_features, aes(x = Importance, y = Variable, fill = Sign)) +
geom_col() +
scale_x_continuous(expand = c(0, 0)) +
labs(y = NULL)
Regularized regression provides many great benefits over traditional GLMs when applied to large data sets with lots of features. It provides a great option for handling the \(n > p\) problem, helps minimize the impact of multicollinearity, and can perform automated feature selection. It also has relatively few hyperparameters which makes them easy to tune, computationally efficient compared to other algorithms discussed in later modules, and memory efficient.
However, similar to GLMs, they are not robust to outliers in both the feature and target. Also, regularized regression models still assume a monotonic linear relationship (always increasing or decreasing in a linear fashion). It is also up to the analyst whether or not to include specific interaction effects.
Using the boston housing dataset:
medv
being the response variable. Perform a cross-validation procedure and tune the model across various penalty parameter values.
medv
being the response variable. Perform a cross-validation procedure and tune the model across various penalty parameter values.