Linear models (LMs) provide a simple, yet effective, approach to predictive modeling. Moreover, when certain assumptions required by LMs are met (e.g., constant variance), the estimated coefficients are unbiased and, of all linear unbiased estimates, have the lowest variance. However, in today’s world, data sets being analyzed typically contain a large number of features. As the number of features grow, certain assumptions typically break down and these models tend to overfit the training data, causing our out of sample error to increase. Regularization methods provide a means to constrain or regularize the estimated coefficients, which can reduce the variance and decrease out of sample error.

Learning objectives

By the end of this module you will know how to:

  • Apply ridge, lasso, and elastic net regularized models.
  • Tune the magnitude and type of the regularization penalty to find an optimal model.
  • Extract and visualize the most influential features.

Prerequisites

# Helper packages
import numpy as np
import pandas as pd
from plotnine import *

# Modeling packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

# Minimize convergence warning messages
import warnings
warnings.filterwarnings("ignore")
# Ames housing data
ames = pd.read_csv("data/ames.csv")

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]

# Helper packages
library(tidyverse) # general data munging & visualization

# Modeling packages
library(tidymodels)

# Model interpretability packages
library(vip)      # for variable importance
# Stratified sampling with the rsample package
ames <- AmesHousing::make_ames()
set.seed(123)  # for reproducibility
split  <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)

Why regularize?

The easiest way to understand regularized regression is to explain how and why it is applied to ordinary least squares (OLS). The objective in OLS regression is to find the hyperplane (e.g., a straight line in two dimensions) that minimizes the sum of squared errors (SSE) between the observed and predicted response values (see Figure below). This means identifying the hyperplane that minimizes the grey lines, which measure the vertical distance between the observed (red dots) and predicted (blue line) response values.

Figure: Fitted regression line using Ordinary Least Squares.

Figure: Fitted regression line using Ordinary Least Squares.

More formally, the objective function being minimized can be written as:

\[\begin{equation} \text{minimize} \left( SSE = \sum^n_{i=1} \left(y_i - \hat{y}_i\right)^2 \right) \end{equation}\]

As we discussed in the linear regression module, the OLS objective function performs quite well when our data adhere to a few key assumptions:

  • Linear relationship;
  • There are more observations (n) than features (p) (\(n > p\));
  • No or little multicollinearity.

For classical statistical inference procedures (e.g., confidence intervals based on the classic t-statistic) to be valid, we also need to make stronger assumptions regarding normality (of the errors) and homoscedasticity (i.e., constant error variance).

Many real-life data sets, like those common to text mining and genomic studies are wide, meaning they contain a larger number of features (\(p > n\)). As p increases, we’re more likely to violate some of the OLS assumptions and alternative approaches should be considered. This was briefly illustrated in the linear regression module where the presence of multicollinearity was diminishing the interpretability of our estimated coefficients due to inflated variance. By reducing multicollinearity, we were able to increase our model’s accuracy. Of course, multicollinearity can also occur when \(n > p\).

Having a large number of features invites additional issues in using classic regression models. For one, having a large number of features makes the model much less interpretable. Additionally, when \(p > n\), there are many (in fact infinite) solutions to the OLS problem! In such cases, it is useful (and practical) to assume that a smaller subset of the features exhibit the strongest effects (something called the bet on sparsity principle (see Hastie, Tibshirani, and Wainwright 2015, 2).). For this reason, we sometimes prefer estimation techniques that incorporate feature selection. One approach to this is called hard thresholding feature selection, which includes many of the traditional linear model selection approaches like forward selection and backward elimination. These procedures, however, can be computationally inefficient, do not scale well, and treat a feature as either in or out of the model (hence the name hard thresholding). In contrast, a more modern approach, called soft thresholding, slowly pushes the effects of irrelevant features toward zero, and in some cases, will zero out entire coefficients. As will be demonstrated, this can result in more accurate models that are also easier to interpret.

With wide data (or data that exhibits multicollinearity), one alternative to OLS regression is to use regularized regression (also commonly referred to as penalized models or shrinkage methods as in Friedman, Hastie, and Tibshirani (2001) and Kuhn and Johnson (2013)) to constrain the total size of all the coefficient estimates. This constraint helps to reduce the magnitude and fluctuations of the coefficients and will reduce the variance of our model (at the expense of no longer being unbiased—a reasonable compromise).

The objective function of a regularized regression model is similar to OLS, albeit with a penalty term \(P\).

\[\begin{equation} \text{minimize} \left( SSE + P \right) \end{equation}\]

This penalty parameter constrains the size of the coefficients such that the only way the coefficients can increase is if we experience a comparable decrease in the sum of squared errors (SSE).

This concept generalizes to all GLM models (e.g., logistic and Poisson regression) and even some survival models. So far, we have been discussing OLS and the sum of squared errors loss function. However, different models within the GLM family have different loss functions (see Chapter 4 of Friedman, Hastie, and Tibshirani (2001)). Yet we can think of the penalty parameter all the same—it constrains the size of the coefficients such that the only way the coefficients can increase is if we experience a comparable decrease in the model’s loss function.

There are three common penalty parameters we can implement:

  1. Ridge;
  2. Lasso (or LASSO);
  3. Elastic net (or ENET), which is a combination of ridge and lasso.

Ridge penalty

Ridge regression (Hoerl and Kennard 1970) controls the estimated coefficients by adding \(\lambda \sum^p_{j=1} \beta_j^2\) to the objective function.

\[\begin{equation} \text{minimize } \left( SSE + \lambda \sum^p_{j=1} \beta_j^2 \right) \end{equation}\]

The size of this penalty, referred to as \(L^2\) (or Euclidean) norm, can take on a wide range of values, which is controlled by the tuning parameter \(\lambda\). When \(\lambda = 0\) there is no effect and our objective function equals the normal OLS regression objective function of simply minimizing SSE. However, as \(\lambda \rightarrow \infty\), the penalty becomes large and forces the coefficients toward zero (but not all the way). This is illustrated below where exemplar coefficients have been regularized with \(\lambda\) ranging from 0 to over 8,000.

Figure: Ridge regression coefficients for 15 exemplar predictor variables as $\lambda$ grows from  $0 \rightarrow \infty$. As $\lambda$ grows larger, our coefficient magnitudes are more constrained.

Figure: Ridge regression coefficients for 15 exemplar predictor variables as \(\lambda\) grows from \(0 \rightarrow \infty\). As \(\lambda\) grows larger, our coefficient magnitudes are more constrained.

Although these coefficients were scaled and centered prior to the analysis, you will notice that some are quite large when \(\lambda\) is near zero. Furthermore, you’ll notice that feature x1 has a large negative parameter that fluctuates until \(\lambda \approx 7\) where it then continuously shrinks toward zero. This is indicative of multicollinearity and likely illustrates that constraining our coefficients with \(\lambda > 7\) may reduce the variance, and therefore the error, in our predictions.

In essence, the ridge regression model pushes many of the correlated features toward each other rather than allowing for one to be wildly positive and the other wildly negative. In addition, many of the less-important features also get pushed toward zero. This helps to provide clarity in identifying the important signals in our data.

However, ridge regression does not perform feature selection and will retain all available features in the final model. Therefore, a ridge model is good if you believe there is a need to retain all features in your model yet reduce the noise that less influential variables may create (e.g., in smaller data sets with severe multicollinearity). If greater interpretation is necessary and many of the features are redundant or irrelevant then a lasso or elastic net penalty may be preferable.

Lasso penalty

The lasso (least absolute shrinkage and selection operator) penalty (Tibshirani 1996) is an alternative to the ridge penalty that requires only a small modification. The only difference is that we swap out the \(L^2\) norm for an \(L^1\) norm: \(\lambda \sum^p_{j=1} | \beta_j|\):

\[\begin{equation} \text{minimize } \left( SSE + \lambda \sum^p_{j=1} | \beta_j | \right) \end{equation}\]

Whereas the ridge penalty pushes variables to approximately but not equal to zero, the lasso penalty will actually push coefficients all the way to zero as illustrated in below. Switching to the lasso penalty not only improves the model but it also conducts automated feature selection.

Figure: Lasso regression coefficients as $\lambda$ grows from  $0 \rightarrow \infty$.

Figure: Lasso regression coefficients as \(\lambda\) grows from \(0 \rightarrow \infty\).

In the figure above we see that when \(\lambda < 0.01\) all 15 variables are included in the model, when \(\lambda \approx 0.5\) 9 variables are retained, and when \(log\left(\lambda\right) = 1\) only 5 variables are retained. Consequently, when a data set has many features, lasso can be used to identify and extract those features with the largest (and most consistent) signal.

Elastic nets

A generalization of the ridge and lasso penalties, called the elastic net (Zou and Hastie 2005), combines the two penalties:

\[\begin{equation} \text{minimize } \left( SSE + \lambda_1 \sum^p_{j=1} \beta_j^2 + \lambda_2 \sum^p_{j=1} | \beta_j | \right) \end{equation}\]

Although lasso models perform feature selection, when two strongly correlated features are pushed towards zero, one may be pushed fully to zero while the other remains in the model. Furthermore, the process of one being in and one being out is not very systematic. In contrast, the ridge regression penalty is a little more effective in systematically handling correlated features together. Consequently, the advantage of the elastic net penalty is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty.

Figure: Elastic net coefficients as $\lambda$ grows from  $0 \rightarrow \infty$.

Figure: Elastic net coefficients as \(\lambda\) grows from \(0 \rightarrow \infty\).

Implementation

We will start by applying a simple ridge model but within each language tab we discuss how to apply a simple lasso and elastic net as well. In this example we simplify and focus on only four features: Gr_Liv_Area, Year_Built, Garage_Cars, and Garage_Area. In this example we will apply a \(\lambda\) penalty parameter equal to 1.

Since regularized methods apply a penalty to the coefficients, we need to ensure our coefficients are on a common scale. If not, then predictors with naturally larger values (e.g., total square footage) will be penalized more than predictors with naturally smaller values (e.g., total number of rooms).

The below applies a ridge model but you could just as easily apply a simple lasso and elastic net model with Lasso() and ElasticNet(). The alpha parameter represents the \(\lambda\) penalty parameter.

Ridge(), Lasso(), and ElasticNet() all provide a normalize parameter that you can use to normalize your features. However, the features will be normalized before regression by subtracting the mean and dividing by the l2-norm rather than the mean. If you wish to standardize, we need to use the StandardScaler preprocessor before calling fit on an estimator with normalize=False.

# Step 1: get features of interest
features = X_train[["Gr_Liv_Area", "Year_Built", "Garage_Cars", "Garage_Area"]]

# Step 2: standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(features)

# Step 3: create model object
ridge_mod = Ridge(alpha=1)

# Step 4: fit/train model
ridge_fit = ridge_mod.fit(X_train_scaled, y_train)
ridge_fit.coef_
## array([[39936.69788684, 24309.03872837,  7826.82312427, 14097.84660676]])

There are a few engines that allow us to apply regularized models. The most popular is glmnet. In R, the \(\lambda\) penalty parameter is represented by penalty and mixture controls the type of penalty (0 = ridge, 1 = lasso, and any value in between represents an elastic net).

# Step 1: create ridge model object
ridge_mod <- linear_reg(penalty = 1, mixture = 0) %>%
  set_engine("glmnet")

# Step 2: create model & preprocessing recipe
model_recipe <- recipe(
    Sale_Price ~ Gr_Liv_Area + Year_Built + Garage_Cars + Garage_Area, 
    data = ames_train
  ) %>%
  step_normalize(all_predictors())
  
# Step 3: fit model workflow
ridge_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(ridge_mod) %>%
  fit(data = ames_train)

# Step 4: extract and tidy results
ridge_fit %>%
  pull_workflow_fit() %>%
  tidy()
## # A tibble: 5 x 3
##   term        estimate penalty
##   <chr>          <dbl>   <dbl>
## 1 (Intercept)  180923.       1
## 2 Gr_Liv_Area   37441.       1
## 3 Year_Built    22587.       1
## 4 Garage_Cars   10825.       1
## 5 Garage_Area   11495.       1

Tuning

There are two main tuning parameters to consider in regularized models: the strength of the regularization parameter (\(\lambda\)) and the type of penalty (ridge, lasso, elastic net).

Tuning regularization strength

First, we’ll asses how the regularization strength impacts the performance of our model. In the examples that follow we will use all the features in our Ames housing data; however, we’ll stick to the basic preprocessing of standarizing our numeric features and one-hot encoding our categorical features.

First, we will standardize and one-hot encode our features.

# create new feature set with encoded features
preprocessor = ColumnTransformer(
  remainder="passthrough",
  transformers=[
    ("scale", StandardScaler(), selector(dtype_include="number")),
    ("one-hot", OneHotEncoder(), selector(dtype_include="object"))
  ])

X_train_encoded = preprocessor.fit_transform(X_train)

Next, we create a standaard Ridge model object. This same procedure could be done using Lasso() or ElasticNet(). The last step in this code chunk is to create a tuning grid of penalty parameter values to assess. This example will look at 50 values ranging from 5.0e-06 to 5.0e+09.

# create model object
ridge_mod = Ridge()

# define loss function
loss = 'neg_root_mean_squared_error'

# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)

# Create grid of hyperparameter values
hyper_grid = {'alpha': 10**np.linspace(10, -5, 50)*0.5}

Now we search for the optimal value using 5-fold cross validation procedure. We see that our optimal regularization strength is:

grid_search = GridSearchCV(ridge_mod, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train_encoded, y_train)

# Optimal penalty parameter in grid search
results.best_estimator_
## Ridge(alpha=6.628556827950554)

Which provides a CV RMSE of the following. With minimal feature engineering this is already the best model performance we’ve obtained compared to previous modules.

# Best model's cross validated RMSE
round(abs(results.best_score_), 2)
## 30085.16

In R, we’ll use penalty = tune() when we create our model object. This allows us to create a placeholder for the parameter we want to tune. We also create a tuning grid containing 50 different lambda penalty values to assess. These values range from 1.0e-05 to 1.0+5 (input range is -10 to 5 but these values are log10 transformed).

Our final results are very similar to those found in the Python code chunks. We see that the optimal penalty parameter produces a 5-fold CV RMSE of just under $31K, which is the best model performance we’ve obtained compared to previous modules.

# create linear model object
ridge_mod <- linear_reg(penalty = tune(), mixture = 0) %>%
  set_engine("glmnet")

# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)

# create our model recipe with a tuning option for number of components
model_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

# create a hyper parameter tuning grid for penalty
hyper_grid <- grid_regular(penalty(range = c(-10, 5)), levels = 50)

# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(ridge_mod, model_recipe, resamples = folds, grid = hyper_grid)

# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
##   penalty .metric .estimator   mean     n std_err .config              
##     <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
## 1  24421. rmse    standard   31620.     5   3051. Preprocessor1_Model48
## 2  49417. rmse    standard   31655.     5   2645. Preprocessor1_Model49
## 3  12068. rmse    standard   32049.     5   3353. Preprocessor1_Model47
## 4   5964. rmse    standard   32649.     5   3579. Preprocessor1_Model46
## 5 100000  rmse    standard   32684.     5   2121. Preprocessor1_Model50

Tuning regularization type & strength

Often, the optimal model includes a penalty that is some combination of both \(L^1\) and \(L^2\) (elastic net), thus we want to tune both the penalty strength and the type of penalty. The following performs this grid search.

In Python we’ll use ElasticNet(). ElasticNet() includes an l1_ratio parameter that can be set between 0-1. When l1_ratio=1 the results are the same as Ridge(), when l1_ratio=0 the results are the same as Lasso(), and values in between performs an elastic net with a specified \(L^1\) and \(L^2\) ratios (i.e. l1_ratio=0.75 uses a 75% \(L^1\) and 25% \(L^2\) penalty).

You may need to increase the tol parameter so the algorithm converges. This is a common concern for lasso models in sci-kit learn. The default tolerance level is 0.0001, which I increase in the below model. See here for more discussion.

# create model object
mod = ElasticNet(tol=0.01)

# Create grid of hyperparameter values
hyper_grid = {
  'alpha': 10**np.linspace(10, -5, 10)*0.5,
  'l1_ratio': (0, 0.25, 0.5, 0.75 , 1)
  }
# 5-fold CV grid search
grid_search = GridSearchCV(mod, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train_encoded, y_train)

# Optimal penalty parameter in grid search
results.best_estimator_
## ElasticNet(alpha=23.207944168063865, l1_ratio=1, tol=0.01)
# Best model's cross validated RMSE
round(abs(results.best_score_), 2)
## 29552.31

In R we use tune() to tune the penalty strength (penalty) and the combination (aka mixture) of \(L^1\) and \(L^2\) penalty. We do so by creating a hyper grid that not only includes the penalty strength values but also 5 values of \(L^1\) and \(L^2\) combinations (0, 0.25, 0.50, 0.75, 1).

# create linear model object
mod <- linear_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)

# create our model recipe with a tuning option for number of components
model_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

# create a hyper parameter tuning grid for penalty
hyper_grid <- grid_regular(
  penalty(range = c(-10, 5)), 
  mixture(), 
  levels = c(mixture = 5, penalty = 50)
  )

# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(mod, model_recipe, resamples = folds, grid = hyper_grid)

# get best results
show_best(results)
## # A tibble: 5 x 8
##   penalty mixture .metric .estimator   mean     n std_err .config               
##     <dbl>   <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                 
## 1  24421.    0    rmse    standard   31992.     5   2296. Preprocessor1_Model048
## 2  49417.    0    rmse    standard   32032.     5   1969. Preprocessor1_Model049
## 3  12068.    0    rmse    standard   32390.     5   2493. Preprocessor1_Model047
## 4   2947.    0.25 rmse    standard   32487.     5   2534. Preprocessor1_Model095
## 5   1456.    0.5  rmse    standard   32692.     5   2601. Preprocessor1_Model144

Feature importance

On thing we have not talked about yet is, once we have a final model how do we know which features are most influential to our model. This concept is known as variable/feature importance. For multiple linear regression models, this is most often measured by the absolute value of the t-statistic for each model parameter used; though simple, the results can be hard to interpret when the model includes interaction effects and complex transformations.

For regularized models, importance is determined by magnitude of the standardized coefficients. Recall that ridge, lasso, and elastic net models push non-influential features to zero (or near zero). Consequently, very small coefficient values represent features that are not very important while very large coefficient values (whether negative or positive) represent very important features.

In Python we need to do a manual process to extract the feature names. We than get the best CV model’s coefficients and combine into a DataFrame. For now we’ll just focus on the top 25 most influential features in our model.

# get feature names
feat_names = ColumnTransformer(
  remainder="passthrough",
  transformers=[
      ("one-hot", OneHotEncoder(), selector(dtype_include="object"))
  ]).fit(X_train).get_feature_names()

# create DataFrame with feature name and coefficients from best model
coef = pd.DataFrame({'feature': feat_names,
                     'coef': results.best_estimator_.coef_})

# indicate if coefficient is positive or negative
coef['abs_coef'] = coef['coef'].abs()
coef['impact'] = np.where(coef['coef']>0, 'positive', 'negative')

# filter for the top 25
top_25_features = coef.nlargest(25, 'abs_coef')
top_25_features.head()
##                     feature           coef       abs_coef    impact
## 163      one-hot__x19_Stone -550233.433278  550233.433278  negative
## 327          Bsmt_Full_Bath -390163.485390  390163.485390  negative
## 121  one-hot__x15_Very_Good -117117.158839  117117.158839  negative
## 89        one-hot__x11_RRAn   99099.308207   99099.308207  positive
## 317          Year_Remod_Add   80523.104479   80523.104479  positive

We can then take this feature importance DataFrame and plot the coefficient magnitudes. We can see two very influential features, both of which are negative, and then another 6 strongly influential features before tailing off to much smaller magnitudes.

# plot feature importance
(ggplot(top_25_features, aes(x='abs_coef', 
                             y='reorder(feature, abs_coef)', 
                             color='impact', 
                             fill='impact'))
 + geom_point()
 + labs(y=None))
## <ggplot: (320018390)>

In R we can use the vip package to simplify the process of extracting feature importance.

# identify best model
lowest_rmse <- results %>%
  select_best("rmse")

# extract best model workflow
final_model <- finalize_workflow(
  workflow() %>% add_recipe(model_recipe) %>% add_model(ridge_mod), 
  lowest_rmse)

# extract feature importance for top 25 most influential features
top_25_features <- final_model %>%
  fit(ames_train) %>%
  pull_workflow_fit() %>% 
  vi(lambda = lowest_rmse$penalty) %>%
  mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  ) %>%
  top_n(25, wt = Importance)

And finally we can plot the feature importance to identify those features most influential (whether positive or negative) in predicting outcomes.

ggplot(top_25_features, aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col() +
  scale_x_continuous(expand = c(0, 0)) +
  labs(y = NULL)

Final thoughts

Regularized regression provides many great benefits over traditional GLMs when applied to large data sets with lots of features. It provides a great option for handling the \(n > p\) problem, helps minimize the impact of multicollinearity, and can perform automated feature selection. It also has relatively few hyperparameters which makes them easy to tune, computationally efficient compared to other algorithms discussed in later modules, and memory efficient.

However, similar to GLMs, they are not robust to outliers in both the feature and target. Also, regularized regression models still assume a monotonic linear relationship (always increasing or decreasing in a linear fashion). It is also up to the analyst whether or not to include specific interaction effects.

Exercises

Using the boston housing dataset:

  1. Apply a ridge model with medv being the response variable. Perform a cross-validation procedure and tune the model across various penalty parameter values.
    • What is the minimum RMSE?
    • What is the penalty parameter value for the optimal model?
    • What are the coefficients for the optimal model?
    • Plot the top 10 most influential features. Do these features have positive or negative impacts on your response variable?
  2. Apply a lasso model with medv being the response variable. Perform a cross-validation procedure and tune the model across various penalty parameter values.
    • What is the minimum RMSE?
    • What is the penalty parameter value for the optimal model?
    • What are the coefficients for the optimal model?
    • Plot the top 10 most influential features. Do these features have positive or negative impacts on your response variable?
  3. Perform a grid search that assess combinations of penalty type (i.e. ridge, lasso, and elastic net combinations) with penalty parameter magnitude.
    • What is the optimal model’s RMSE?
    • What are the parameters (penalty type & magnitude) for the optimal model?
    • How does it compare to your previous models?
    • Plot the top 10 most influential features. Do these features have positive or negative impacts on your response variable?

🏠

References

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer Series in Statistics New York, NY, USA:
Hastie, T., R. Tibshirani, and M. Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis.
Hoerl, Arthur E, and Robert W Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.
Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 267–88.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.