Linear regression, a staple of classical statistical modeling, is one of the simplest algorithms for doing supervised learning. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later modules, linear regression is still a useful and widely applied statistical learning method. Moreover, it serves as a good starting point for more advanced approaches; as we will see in later modules, many of the more sophisticated statistical learning approaches can be seen as generalizations to or extensions of ordinary linear regression. Consequently, it is important to have a good understanding of linear regression before studying more complex learning methods. This module introduces linear regression with an emphasis on prediction, rather than inference. An excellent and comprehensive overview of linear regression is provided in Kutner et al. (2005). See Faraway (2016) for a discussion of linear regression in R (the book’s website also provides Python scripts).

Learning objectives

By the end of this module you will know how to:

  • Apply and interpret simple and multiple linear regression models.
  • Minimize the effects of collinearity and improve model performance with principal component regression and partial least squares regression models.

Prerequisites

# Helper packages
import pandas as pd
import numpy as np

# Modeling packages
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import compose
from sklearn import cross_decomposition
from sklearn import decomposition
from sklearn import model_selection
from sklearn import linear_model
from sklearn import pipeline
# Ames housing data
ames = pd.read_csv("data/ames.csv")

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]

# Helper packages
library(dplyr)    # for data manipulation
library(ggplot2)  # for awesome graphics

# Modeling packages
library(tidymodels)
library(plsmod)

# Model interpretability packages
library(vip)      # variable importance
# Ames housing data
ames <- AmesHousing::make_ames()

# create training/test split
set.seed(123)
split  <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)

Simple linear regression

Pearson’s correlation coefficient is often used to quantify the strength of the linear association between two continuous variables. In this section, we seek to fully characterize that linear relationship. Simple linear regression (SLR) assumes that the statistical relationship between two continuous variables (say \(X\) and \(Y\)) is (at least approximately) linear:

\[\begin{equation} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad \text{for } i = 1, 2, \dots, n, \end{equation}\]

where \(Y_i\) represents the i-th response value, \(X_i\) represents the i-th feature value, \(\beta_0\) and \(\beta_1\) are fixed, but unknown constants (commonly referred to as coefficients or parameters) that represent the intercept and slope of the regression line, respectively, and \(\epsilon_i\) represents noise or random error. In this module, we’ll assume that the errors are normally distributed with mean zero and constant variance \(\sigma^2\), denoted \(\stackrel{iid}{\sim} \left(0, \sigma^2\right)\). Since the random errors are centered around zero (i.e., \(E\left(\epsilon\right) = 0\)), linear regression is really a problem of estimating a conditional mean:

\[\begin{equation} E\left(Y_i | X_i\right) = \beta_0 + \beta_1 X_i. \end{equation}\]

For brevity, we often drop the conditional piece and write \(E\left(Y | X\right) = E\left(Y\right)\). Consequently, the interpretation of the coefficients is in terms of the average, or mean response. For example, the intercept \(\beta_0\) represents the average response value when \(X = 0\) (it is often not meaningful or of interest and is sometimes referred to as a bias term). The slope \(\beta_1\) represents the increase in the average response per one-unit increase in \(X\) (i.e., it is a rate of change).

Best fit line

Ideally, we want estimates of \(\beta_0\) and \(\beta_1\) that give us the “best fitting” line. But what is meant by “best fitting?” The most common approach is to use the method of least squares (LS) estimation; this form of linear regression is often referred to as ordinary least squares (OLS) regression. There are multiple ways to measure “best fitting,” but the LS criterion finds the “best fitting” line by minimizing the residual sum of squares (RSS):

\[\begin{equation} RSS\left(\beta_0, \beta_1\right) = \sum_{i=1}^n\left[Y_i - \left(\beta_0 + \beta_1 X_i\right)\right]^2 = \sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2. \end{equation}\]

The LS estimates of \(\beta_0\) and \(\beta_1\) are denoted as \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\), respectively. Once obtained, we can generate predicted values, say at \(X = X_{new}\), using the estimated regression equation:

\[\begin{equation} \widehat{Y}_{new} = \widehat{\beta}_0 + \widehat{\beta}_1 X_{new}, \end{equation}\]

where \(\widehat{Y}_{new} = \widehat{E\left(Y_{new} | X = X_{new}\right)}\) is the estimated mean response at \(X = X_{new}\).

With the Ames housing data, suppose we wanted to model a linear relationship between the total above ground living space of a home (Gr_Liv_Area) and sale price (Sale_Price). We can perform an OLS regression model in Python and R with the following:

# create linear regression model object
lm_mod = linear_model.LinearRegression()

# fit linear model with only Gr_Liv_Area feature
lm_fit = lm_mod.fit(X_train[["Gr_Liv_Area"]], y_train)

# create linear regression model object
lm_mod <- linear_reg() %>%  set_engine("lm")

# fit linear model with only Gr_Liv_Area feature
lm_fit <- 
  lm_mod %>% 
  fit(Sale_Price ~ Gr_Liv_Area, data = ames_train)

Visualizing the best fit line

The fitted model (lm_fit) is displayed in the left plot below where the points represent the values of Sale_Price in the training data. In the right plot, the vertical lines represent the individual errors, called residuals, associated with each observation. The OLS criterion identifies the “best fitting” line that minimizes the sum of squares of these residuals.

Figure: The least squares fit from regressing sale price on living space for the the Ames housing data. Left: Fitted regression line. Right: Fitted regression line with vertical grey bars representing the residuals.

Figure: The least squares fit from regressing sale price on living space for the the Ames housing data. Left: Fitted regression line. Right: Fitted regression line with vertical grey bars representing the residuals.

Coefficients

The estimated coefficients from our models are shown in the code chunks below. Recall that there are slight differences between R & Python because the test/ & train data splits differ between the two instances. However, the results are very similar with our intercept ( \(\widehat{\beta}_0\) ) equaling around 15,800 and the coefficient for the Gr_Liv_Area variable ( \(\widehat{\beta}_1\) ) equaling approximately 110. To interpret, we estimate that the mean selling price of a home increases by about $110 for each additional one square foot of above ground living space. This simple description of the relationship between the sale price and square footage using a single number (i.e. the slope) is what makes linear regression such an intuitive and popular modeling tool.

# intercept
lm_fit.intercept_
## array([15770.15274496])
# coefficient for Gr_Liv_Area
lm_fit.coef_
## array([[110.54517166]])

lm_fit
## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
## 
## Coefficients:
## (Intercept)  Gr_Liv_Area  
##     15938.2        109.7

Multiple linear regression

In practice, we often have more than one predictor. For example, with the Ames housing data, we may wish to understand if above ground square footage (Gr_Liv_Area) and the year the house was built (Year_Built) are (linearly) related to sale price (Sale_Price). We can extend the SLR model so that it can directly accommodate multiple predictors; this is referred to as the multiple linear regression (MLR) model. With two predictors, the MLR model becomes:

\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon, \end{equation}\]

where \(X_1\) and \(X_2\) are features of interest. In our Ames housing example, \(X_1\) represents Gr_Liv_Area and \(X_2\) represents Year_Built.

Fitting an MLR

We can fit an MLR with Python and R as follows.

# create linear regression model object
lm_mod = linear_model.LinearRegression()

# fit linear model with only Gr_Liv_Area and Year_Built feature
lm_fit = lm_mod.fit(X_train[["Gr_Liv_Area", "Year_Built"]], y_train)

# create linear regression model object
lm_mod <- linear_reg() %>%  set_engine("lm")

# fit linear model with only Gr_Liv_Area and Year_Built feature
lm_fit <- 
  lm_mod %>% 
  fit(Sale_Price ~ Gr_Liv_Area + Year_Built, data = ames_train)

Coefficients

The LS estimates of the regression coefficients for Gr_Liv_Area (\(\widehat{\beta}_1\)) is approximately 94 and for Year_Built (\(\widehat{\beta}_2\)) it is approximately 1090 (the estimated intercept is around -2100000. In other words, every one square foot increase to above ground square footage is associated with an additional $94 in mean selling price when holding the year the house was built constant. Likewise, for every year newer a home is there is approximately an increase of $1,090 in selling price when holding the above ground square footage constant.

Recall the slight difference in coefficient values is due to the fact that we are using a different training set between R and Python.

# intercept
lm_fit.intercept_
## array([-2127545.21682598])
# coefficients for Gr_Liv_Area and Year_Built
lm_fit.coef_
## array([[  94.82771439, 1099.10620648]])

lm_fit
## parsnip model object
## 
## Fit time:  2ms 
## 
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area + Year_Built, data = data)
## 
## Coefficients:
## (Intercept)  Gr_Liv_Area   Year_Built  
##  -2.103e+06    9.383e+01    1.087e+03

Interactions

A contour plot of our most recent fitted regression surface is displayed in the left side of the figure below. Note how the fitted regression surface is flat (i.e., it does not twist or bend). This is true for all linear models that include only main effects (i.e., terms involving only a single predictor). One way to model curvature is to include interaction effects. An interaction occurs when the effect of one predictor on the response depends on the values of other predictors. In linear regression, interactions can be captured via products of features (i.e., \(X_1 \times X_2\)). A model with two main effects can also include a two-way interaction. For example, to include an interaction between \(X_1 =\) Gr_Liv_Area and \(X_2 =\) Year_Built, we introduce an additional product term:

\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon. \end{equation}\]

Figure: In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The 'best-fit' plane minimizes the sum of squared errors between the actual sales price (individual dots) and the predicted sales price (plane).

Figure: In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The ‘best-fit’ plane minimizes the sum of squared errors between the actual sales price (individual dots) and the predicted sales price (plane).

Interaction effects are quite prevalent in predictive modeling. Since linear models are an example of parametric modeling, it is up to the analyst to decide if and when to include interaction effects. In later modules, we’ll discuss algorithms that can automatically detect and incorporate interaction effects (albeit in different ways). It is also important to understand a concept called the hierarchy principle—which demands that all lower-order terms corresponding to an interaction be retained in the model—when considering interaction effects in linear regression models.

In R and Python we can fit models with an interaction effect in the following manner:

# create linear regression model object
lm_mod = linear_model.LinearRegression()

# use PolynomialFeatures to create main Gr_Liv_Area and Year_Built effects and
# also an interaction effect between Gr_Liv_Area & Year_Built
effects = preprocessing.PolynomialFeatures(
  interaction_only=True,
  include_bias=False
  )
features = effects.fit_transform(X_train[["Gr_Liv_Area", "Year_Built"]])

# fit linear model with only Gr_Liv_Area and Year_Built feature and
# also include an interaction effect (Gr_Liv_Area:Year_Built)
lm_fit = lm_mod.fit(features, y_train)
# coefficients for Gr_Liv_Area, Year_Built effects and the interaction 
# effect between Gr_Liv_Area & Year_Built
lm_fit.coef_
## array([[-6.62106855e+02,  4.90386820e+02,  3.84056493e-01]])

# create linear regression model object
lm_mod <- linear_reg() %>%  set_engine("lm")

# fit linear model with Gr_Liv_Area and Year_Built main effects and
# also include an interaction effect (Gr_Liv_Area:Year_Built)
lm_fit <- 
  lm_mod %>% 
  fit(
    Sale_Price ~ Gr_Liv_Area + Year_Built + Gr_Liv_Area:Year_Built, 
    data = ames_train
    )

lm_fit
## parsnip model object
## 
## Fit time:  2ms 
## 
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area + Year_Built + Gr_Liv_Area:Year_Built, 
##     data = data)
## 
## Coefficients:
##            (Intercept)             Gr_Liv_Area              Year_Built  
##             -8.105e+05              -7.285e+02               4.309e+02  
## Gr_Liv_Area:Year_Built  
##              4.168e-01

Many features

In general, we can include as many predictors as we want, as long as we have more rows than parameters! The general multiple linear regression model with p distinct predictors is

\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon, \end{equation}\]

where \(X_i\) for \(i = 1, 2, \dots, p\) are the predictors of interest. Note some of these may represent interactions (e.g., \(X_3 = X_1 \times X_2\)) between our transformations1 (e.g., \(X_4 = \sqrt{X_1}\)) of the original features. Unfortunately, visualizing beyond three dimensions is not practical as our best-fit plane becomes a hyperplane. However, the motivation remains the same where the best-fit hyperplane is identified by minimizing the RSS.

So say we want to run an MLR model with all the features in our Ames data set as main effects (i.e., no interaction terms) to predict Sale_Price. The following allows us to do so:

As discussed in the feature engineering module, Python does not automatically handle categorical features. Consequently, to use all the features in the Ames housing data we need to preprocess the categorical features to turn them into numeric representations. The following simply dummy encodes all non-numeric features and then uses this transformed feature set in an MLR model.

# create new feature set with categorical features dummy encoded
encoder = preprocessing.OneHotEncoder(drop='first')
cat_feat_only = compose.make_column_selector(dtype_include="object")
preprocessor = compose.ColumnTransformer(
  remainder="passthrough",
  transformers=[("one-hot", encoder, cat_feat_only)]
  )
X_train_encoded = preprocessor.fit_transform(X_train)

# MLR model with new dummy encoded feature set
lm_mod = linear_model.LinearRegression()
lm_fit = lm_mod.fit(X_train_encoded, y_train)

We can see the first 10 coefficients as follows:

lm_fit.coef_[0, 0:10]
## array([  7053.66861508,   2919.81256667, -12106.538252  ,  13728.62997568,
##         16717.06921586,  22098.14806837,  -6801.68767744,  -4196.18259742,
##         -6176.20220396,  -1275.59421824])

In R, the shorthand notation to include all features is target_variable ~ .. Behind the scenes, R will automatically dummy encode all categorical features. Since there are a lot of features we can use tidy to summarize the results in a more predictable and useful format (e.g. a data frame with standard column names). The “estimate” column represents the estimated coefficients.

# create linear regression model object
lm_mod <- linear_reg() %>%  set_engine("lm")

# fit linear model with all features
lm_fit <- 
  lm_mod %>% 
  fit(Sale_Price ~ ., data = ames_train)

tidy(lm_fit)
## # A tibble: 300 x 5
##    term                                     estimate std.error statistic p.value
##    <chr>                                       <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)                               -2.73e7 11016450.   -2.47    0.0134
##  2 MS_SubClassOne_Story_1945_and_Older        3.90e3     3575.    1.09    0.275 
##  3 MS_SubClassOne_Story_with_Finished_Att…   -5.39e3    12164.   -0.443   0.658 
##  4 MS_SubClassOne_and_Half_Story_Unfinish…   -4.41e2    13942.   -0.0316  0.975 
##  5 MS_SubClassOne_and_Half_Story_Finished…    1.04e3     7250.    0.143   0.886 
##  6 MS_SubClassTwo_Story_1946_and_Newer       -6.67e3     5510.   -1.21    0.226 
##  7 MS_SubClassTwo_Story_1945_and_Older        1.57e3     6074.    0.259   0.795 
##  8 MS_SubClassTwo_and_Half_Story_All_Ages     3.41e3    10149.    0.336   0.737 
##  9 MS_SubClassSplit_or_Multilevel            -6.67e3    11673.   -0.571   0.568 
## 10 MS_SubClassSplit_Foyer                     1.49e3     7512.    0.199   0.843 
## # … with 290 more rows

Assessing model accuracy

We’ve fit three main effects models to the Ames housing data: a single predictor, two predictors, and all possible predictors. But the question remains, which model is “best?” To answer this question we have to define what we mean by “best.” In our case, we’ll use the RMSE metric along with cross-validation to determine the “best” model.

To compare the same model type across multiple different feature combinations we:

  1. create the different feature sets (we reuse the dummy encoded feature set from the earlier section),
  2. define our loss function and resampling procedure,
  3. initialize an empty object to hold our results, and
  4. iterate over each feature set and apply a cross validated linear regression model.
# feature sets to compare across
feature_set1 = X_train[["Gr_Liv_Area"]]
feature_set2 = X_train[["Gr_Liv_Area", "Year_Built"]]
feature_set3 = X_train_encoded
feature_sets = {'lm1': feature_set1, 'lm2': feature_set2, 'lm3': feature_set3}

# define loss function
loss = 'neg_root_mean_squared_error'

# create 10 fold CV object
kfold = model_selection.KFold(n_splits=10, random_state=8451, shuffle=True)

# object to store CV RMSE results
results = {}

for name, feat in feature_sets.items():
  # create LM model object
  lm_mod = linear_model.LinearRegression()

  # execute and score the cross validation procedure
  cv_results = model_selection.cross_val_score(
    estimator=lm_mod, 
    X=feat, 
    y=y_train, 
    cv=kfold, 
    scoring=loss
    )
  results[name] = np.absolute(cv_results.mean())

We can see that as we add additional features our CV RMSE decreases.

results
## {'lm1': 58191.00896279409, 'lm2': 48202.33209737469, 'lm3': 34562.00023075457}

To compare different model specifications in R we:

  1. initialize our regression model object,
  2. create our three different model recipes, and then
  3. combine them into a workflow object
# create linear regression model object
lm_mod <- linear_reg() %>%  set_engine("lm")

# create three model recipes
lm1 <- recipe(Sale_Price ~ Gr_Liv_Area, data = ames_train) %>% 
  step_dummy(all_nominal_predictors())
lm2 <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built, data = ames_train) %>% 
  step_dummy(all_nominal_predictors())
lm3 <- recipe(Sale_Price ~ ., data = ames_train) %>% 
  step_dummy(all_nominal_predictors())

# combine model objects and recipes into a workflow object
preproc <- list(lm1, lm2, lm3)
models <- list(lm_mod)

model_set <- workflow_set(preproc, models, cross = TRUE)
model_set
## # A workflow set/tibble: 3 x 4
##   wflow_id            info             option      result    
##   <chr>               <list>           <list>      <list>    
## 1 recipe_1_linear_reg <tibble [1 × 4]> <wrkflw__ > <list [0]>
## 2 recipe_2_linear_reg <tibble [1 × 4]> <wrkflw__ > <list [0]>
## 3 recipe_3_linear_reg <tibble [1 × 4]> <wrkflw__ > <list [0]>
  1. create our k-fold CV object and
  2. iterate over our workflow object to execute and score the cross validation procedure
# create our k-fold CV object
kfold <- vfold_cv(ames_train, v = 10)

# iterate over our workflow object to execute and score the cross 
# validation procedure
lm_models <- model_set %>%
  workflow_map("fit_resamples",
               seed = 8451,
               resamples = kfold)

lm_models
## # A workflow set/tibble: 3 x 4
##   wflow_id            info             option      result   
##   <chr>               <list>           <list>      <list>   
## 1 recipe_1_linear_reg <tibble [1 × 4]> <wrkflw__ > <rsmp[+]>
## 2 recipe_2_linear_reg <tibble [1 × 4]> <wrkflw__ > <rsmp[+]>
## 3 recipe_3_linear_reg <tibble [1 × 4]> <wrkflw__ > <rsmp[+]>

We can now access the CV RMSE, which is in the “mean” column. We can see that as we add additional features our CV RMSE decreases.

collect_metrics(lm_models) %>% 
  filter(.metric == "rmse")
## # A tibble: 3 x 9
##   wflow_id     .config    preproc model  .metric .estimator   mean     n std_err
##   <chr>        <chr>      <chr>   <chr>  <chr>   <chr>       <dbl> <int>   <dbl>
## 1 recipe_1_li… Preproces… recipe  linea… rmse    standard   56812.    10   1568.
## 2 recipe_2_li… Preproces… recipe  linea… rmse    standard   47071.    10   1628.
## 3 recipe_3_li… Preproces… recipe  linea… rmse    standard   41008.    10   5954.

R’s workflow procedure may seem daunting at first but it provides a lot of flexibility once you become comfortable with the concept. Read more about workflows here: https://www.tmwr.org/workflows.html

Principal component regression

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in OLS, since it can be difficult to separate out the individual effects of collinear variables on the response. In fact, collinearity can cause predictor variables to appear as statistically insignificant when in fact they are significant. This obviously leads to an inaccurate interpretation of coefficients and makes it difficult to identify influential predictors.

In the Ames data, for example, Garage_Area and Garage_Cars are two variables that have a correlation of 0.89 and both variables are strongly related to our response variable (Sale_Price). If we fit a model with all main effects where both of these variables are included, note the coefficient estimate and p-values below for Garage_Area and Garage_Cars.

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 Garage_Cars   3151.    1733.        1.82  0.0693
## 2 Garage_Area     11.9      5.69      2.10  0.0359

However, if we refit the full model without Garage_Area, the coefficient estimate for Garage_Car increases nearly two fold and the p-value becomes much more significant.

## # A tibble: 1 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 Garage_Cars    5765.     1207.      4.78 0.00000194

This reflects the instability in the linear regression model caused by between-predictor relationships; this instability also gets propagated directly to the model predictions. Considering 16 of our 34 numeric predictors have a medium to strong correlation, the biased coefficients of these predictors are likely restricting the predictive accuracy of our model. How can we control for this problem? One option is to manually remove the offending predictors (one-at-a-time) until all pairwise correlations are below some pre-determined threshold. However, when the number of predictors is large such as in our case, this becomes tedious. Moreover, multicollinearity can arise when one feature is linearly related to two or more features (which is more difficult to detect2). In these cases, manual removal of specific predictors may not be possible. Consequently, this section and the next offer two simple extensions of linear regression where dimension reduction is applied prior to performing linear regression. A later module on regularized regression offers a modified approach that helps to deal with the problem. And future modules provide alternative methods that are less affected by multicollinearity.

As mentioned in the feature engineering module principal components analysis can be used to represent correlated variables with a smaller number of uncorrelated features (called principal components) and the resulting components can be used as predictors in a linear regression model. This two-step process is known as principal component regression (PCR) (Massy 1965) and is illustrated below.

A depiction of the steps involved in performing principal component regression.

A depiction of the steps involved in performing principal component regression.

Implementation

Performing PCR with R and Python is an easy extension from our previous models. Often, we can greatly improve performance by only using a small subset of all principal components as predictors. For starters, we’ll just use 10 components. Note that our preprocessing will first normalize all numeric features and then perform PCA on these features. After the PCA we then one-hot encode our categorical features.

# create linear model object
lm_mod = linear_model.LinearRegression()

# create k-fold cross validation object
kfold = model_selection.KFold(n_splits=5, random_state=8451, shuffle=True)

# define loss function
loss = 'neg_root_mean_squared_error'

# create our preprocessing steps which includes performing PCA 
# with 10 components
scaler = preprocessing.StandardScaler()
pca = decomposition.PCA(n_components=10)
encoder = preprocessing.OneHotEncoder(handle_unknown="ignore")
num_feat_only = compose.make_column_selector(dtype_include="number")
cat_feat_only = compose.make_column_selector(dtype_include="object")

# combine all steps into a pre-processing pipeline
preprocessor = compose.ColumnTransformer(
  remainder="passthrough",
  transformers=[
  ("std_encode", scaler, num_feat_only),
  ("pca_encode", pca, num_feat_only),
  ("one-hot", encoder, cat_feat_only),
  ])

# create a pipeline object that combines model with recipe
model_pipeline = pipeline.Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("lm", lm_mod),
])

# train and fit our model
cv_results = model_selection.cross_val_score(
  estimator=model_pipeline, 
  X=X_train, 
  y=y_train, 
  cv=kfold, 
  scoring=loss
  )

# get results
np.absolute(cv_results.mean())
## 34221.48344930283

# create linear model object
lm_mod <- linear_reg() %>%  set_engine("lm")

# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)

# create our preprocessing steps which includes performing PCA 
# with 10 components
model_form <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_pca(all_numeric(), -all_outcomes(), num_comp = 10) %>%
  step_dummy(all_nominal_predictors())

# create a workflow object that combines model with recipe
pca_wf <- workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(model_form)

# train and fit our model
set.seed(123)
pca_fit_rs <- pca_wf %>% 
  fit_resamples(folds)

# get results
collect_metrics(pca_fit_rs) %>%
  filter(.metric == "rmse")
## # A tibble: 1 x 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   38983.     5   2438. Preprocessor1_Model1

Tuning

In the above example, we simply assumed 10 principal components was satisfactory. This is rarely the case and the optimal number of principal components to use will change depending on the data set and correlation amongst the predictors. However, often, we can greatly improve performance by only using a small subset of all principal components as predictors. Consequently, you can think of the number of principal components as a tuning parameter. The following performs cross-validated PCR with \(1, 2, \dots, 26\) principal components, and the plot below illustrates the cross-validated RMSE. You can see a significant drop in prediction error from our previous linear models using just a few principal components.

To tune our Python model we simply include a for loop that iterates over 2, 4, 6, …, 26 components.

# create k-fold cross validation object
kfold = model_selection.KFold(n_splits=5, random_state=8451, shuffle=True)

# define loss function
loss = 'neg_root_mean_squared_error'

# create our preprocessing steps
scaler = preprocessing.StandardScaler()
encoder = preprocessing.OneHotEncoder(handle_unknown="ignore")
num_feat_only = compose.make_column_selector(dtype_include="number")
cat_feat_only = compose.make_column_selector(dtype_include="object")

# create object to save results
results = {}

# iterate over over 2, 4, 6, ..., 26 components and train model
for n_comp in range(2, 28, 2):
  # create PCA object with n components
  pca = decomposition.PCA(n_components=n_comp)
  
# combine all steps into a pre-processing pipeline
  preprocessor = compose.ColumnTransformer(
    remainder="passthrough",
    transformers=[
    ("std_encode", scaler, num_feat_only),
    ("pca_encode", pca, num_feat_only),
    ("one-hot", encoder, cat_feat_only),
    ])

  # create linear model object
  lm_mod = linear_model.LinearRegression()

  # create a pipeline object that combines model with recipe
  model_pipeline = pipeline.Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("lm", lm_mod),
  ])

  # train and fit our model
  cv_results = model_selection.cross_val_score(
    estimator=model_pipeline, 
    X=X_train, 
    y=y_train, 
    cv=kfold, 
    scoring=loss
    )

  # get results
  results[n_comp] = np.absolute(cv_results.mean())
pd.DataFrame.from_dict(
  results,
  orient='index',
  columns=['RMSE']
  ).rename_axis('n_components').reset_index()
##     n_components          RMSE
## 0              2  35677.816818
## 1              4  34675.980057
## 2              6  34293.406987
## 3              8  34350.497524
## 4             10  34334.501044
## 5             12  34108.333581
## 6             14  34124.599805
## 7             16  33941.978026
## 8             18  33836.234345
## 9             20  33422.600499
## 10            22  33318.811912
## 11            24  33288.292994
## 12            26  33257.730855

To tune our R model, we’ll use tune() in place of the num_comp parameter in step_pca(). Our results show that multiple models, all with very low number of components, perform relatively the same.

# create linear model object
lm_mod <- linear_reg() %>%  set_engine("lm")

# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)

# create our model recipe with a tuning option for number of components
model_form <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_pca(all_numeric(), -all_outcomes(), num_comp = tune()) %>%
  step_dummy(all_nominal_predictors())

# create a hyper parameter tuning grid for number of components
hyper_grid <- expand.grid(num_comp = seq(1, 26, by = 2))

# train our model across the hyper parameter grid
set.seed(8451)
results <- tune_grid(lm_mod, model_form, resamples = folds, grid = hyper_grid)

# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
##   num_comp .metric .estimator   mean     n std_err .config              
##      <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
## 1        1 rmse    standard   38808.     5   3448. Preprocessor01_Model1
## 2        5 rmse    standard   39753.     5   3857. Preprocessor03_Model1
## 3        9 rmse    standard   39823.     5   3855. Preprocessor05_Model1
## 4        7 rmse    standard   39826.     5   3885. Preprocessor04_Model1
## 5        3 rmse    standard   39891.     5   3770. Preprocessor02_Model1

Partial least squares

It’s important to note that since PCR is a two step process, the PCA step does not consider any aspects of the response when it selects the components. Consequently, the new predictors produced by the PCA step are not designed to maximize the relationship with the response. Instead, it simply seeks to reduce the variability present throughout the predictor space. If that variability happens to be related to the response variability, then PCR has a good chance to identify a predictive relationship, as in our case. If, however, the variability in the predictor space is not related to the variability of the response, then PCR can have difficulty identifying a predictive relationship when one might actually exists (i.e., we may actually experience a decrease in our predictive accuracy). An alternative approach to reduce the impact of multicollinearity is partial least squares (PLS).

PLS can be viewed as a supervised dimension reduction procedure (Kuhn and Johnson 2013). Similar to PCR, this technique also constructs a set of linear combinations of the inputs for regression, but unlike PCR it uses the response variable to aid the construction of the principal components as illustrated below. Thus, we can think of PLS as a supervised dimension reduction procedure that finds new features that not only captures most of the information in the original features, but also are related to the response.

Figure: A diagram depicting the differences between PCR (left) and PLS (right). PCR finds principal components (PCs) that maximally summarize the features independent of the response variable and then uses those PCs as predictor variables. PLS finds components that simultaneously summarize variation of the predictors while being optimally correlated with the outcome and then uses those PCs as predictors.

Figure: A diagram depicting the differences between PCR (left) and PLS (right). PCR finds principal components (PCs) that maximally summarize the features independent of the response variable and then uses those PCs as predictor variables. PLS finds components that simultaneously summarize variation of the predictors while being optimally correlated with the outcome and then uses those PCs as predictors.

We illustrate PLS with some exemplar data3. The below plot illustrates that the first two PCs when using PCR have very little relationship to the response variable; however, the first two PCs when using PLS have a much stronger association to the response.

Figure: Illustration showing that the first two PCs when using PCR have very little relationship to the response variable (top row); however, the first two PCs when using PLS have a much stronger association to the response (bottom row).

Figure: Illustration showing that the first two PCs when using PCR have very little relationship to the response variable (top row); however, the first two PCs when using PLS have a much stronger association to the response (bottom row).

PLS will compute the principals by taking into consideration the correlation between the features and the response variable. See Friedman, Hastie, and Tibshirani (2001) and Geladi and Kowalski (1986) for a thorough discussion of PLS.

To perform PLS in Python we use PLSRegression() from the cross_decomposition module. Note that PLSRegression() only accepts a dense matrix so we need to set sparse=False when one-hot encoding (by default OneHotEncoder returns a sparse matrix). In this example we execute a hyperparameter grid search over 2-26 components. Our results show that PLS performs better than PCR.

# create linear model object
pls_mod = cross_decomposition.PLSRegression()

# create k-fold cross validation object
kfold = model_selection.KFold(n_splits=5, random_state=8451, shuffle=True)

# define loss function
loss = 'neg_root_mean_squared_error'

# create our preprocessing steps to normalize and one-hot encode
scaler = preprocessing.StandardScaler()
encoder = preprocessing.OneHotEncoder(handle_unknown="ignore", sparse=False)
num_feat_only = compose.make_column_selector(dtype_include="number")
cat_feat_only = compose.make_column_selector(dtype_include="object")

# combine all steps into a pre-processing pipeline
preprocessor = compose.ColumnTransformer(
  remainder="passthrough",
  transformers=[
    ("std_encode", scaler, num_feat_only),
    ("one-hot", encoder, cat_feat_only),
  ])

# create a pipeline object that combines model with recipe
model_pipeline = pipeline.Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("pls", pls_mod),
])

# Create grid of hyperparameter values
hyper_grid = {'pls__n_components': range(2, 28, 2)}

# Tune a knn model using grid search
grid_search = model_selection.GridSearchCV(
  model_pipeline, 
  hyper_grid, 
  cv=kfold, 
  scoring=loss
  )
  
results = grid_search.fit(X_train, y_train)

# Best model's cross validated RMSE
abs(results.best_score_)
## 30834.84469515183

The optimal number of components that provided the lowest RMSE is:

# Optimal number of components
results.best_estimator_.get_params().get('pls__n_components')
## 4

To perform PLS in R we use pls() from the plsmod package. Note that plsmod requires the mixOmics package to be installed. In this example we execute a hyperparameter grid search over 2-26 components. Our results show that PLS performs quite a bit better than PCR.

# create PLS model object with a tuning option for number of components
pls_mod <- plsmod::pls(num_comp = tune()) %>%  
  set_engine("mixOmics") %>%
  set_mode("regression")

# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)

# create our model recipe
model_form <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal_predictors())

# create a hyper parameter tuning grid for number of components
hyper_grid <- expand.grid(num_comp = seq(2, 26, by = 2))

# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(pls_mod, model_form, resamples = folds, grid = hyper_grid)

# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
##   num_comp .metric .estimator   mean     n std_err .config              
##      <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
## 1        2 rmse    standard   33223.     5   2620. Preprocessor1_Model01
## 2        4 rmse    standard   33399.     5   2257. Preprocessor1_Model02
## 3       22 rmse    standard   33500.     5   4586. Preprocessor1_Model11
## 4       24 rmse    standard   33533.     5   4698. Preprocessor1_Model12
## 5       26 rmse    standard   33570.     5   4771. Preprocessor1_Model13

Final thoughts

Linear regression is usually the first supervised learning algorithm you will learn. The approach provides a solid fundamental understanding of the supervised learning task; however, as we’ve discussed there are several concerns that result from the assumptions required. Although extensions of linear regression that integrate dimension reduction steps into the algorithm can help address some of the problems with linear regression, more advanced supervised algorithms typically provide greater flexibility and improved accuracy. Nonetheless, understanding linear regression provides a foundation that will serve you well in learning these more advanced methods.

Exercises

Using the Boston housing data set where the response feature is the median value of homes within a census tract (cmedv):

  1. Pick a single feature and apply a simple linear regression model.
    • Interpret the feature’s coefficient
    • What is the model’s performance? How does it compare to the KNN in the last module?
  2. Pick another feature to add to the model.
    • Before applying the model why do you think this feature will help?
    • Apply a linear regression model with the two features and compare to the simple linear model.
    • Interpret the coefficients.
  3. Now apply a model that includes all the predictors.
    • How does this model compare to the previous two?
  4. Can you identify any model concerns?
  5. Apply a principal component regression model.
    • Perform a grid search over several components.
    • Identify and explain the performance of the optimal model.
  6. Now apply a partial least square model.
    • Perform a grid search over several components.
    • Identify and explain the performance of the optimal model.

🏠

References

Faraway, Julian J. 2016. Linear Models with r. Chapman; Hall/CRC.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer Series in Statistics New York, NY, USA:
Geladi, Paul, and Bruce R Kowalski. 1986. “Partial Least-Squares Regression: A Tutorial.” Analytica Chimica Acta 185: 1–17.
Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.
———. 2018. AppliedPredictiveModeling: Functions and Data Sets for ’Applied Predictive Modeling’. https://CRAN.R-project.org/package=AppliedPredictiveModeling.
Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li. 2005. Applied Linear Statistical Models. 5th ed. McGraw Hill.
Massy, William F. 1965. “Principal Components Regression in Exploratory Statistical Research.” Journal of the American Statistical Association 60 (309): 234–56.

  1. Transformations of the features serve a number of purposes (e.g., modeling nonlinear relationships or alleviating departures from common regression assumptions). See Kutner et al. (2005) for details.↩︎

  2. In such cases we can use a statistic called the variance inflation factor which tries to capture how strongly each feature is linearly related to all the others predictors in a model.↩︎

  3. This is actually using the solubility data that is provided by the AppliedPredictiveModeling package (Kuhn and Johnson 2018).↩︎