Linear regression, a staple of classical statistical modeling, is one of the simplest algorithms for doing supervised learning. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later modules, linear regression is still a useful and widely applied statistical learning method. Moreover, it serves as a good starting point for more advanced approaches; as we will see in later modules, many of the more sophisticated statistical learning approaches can be seen as generalizations to or extensions of ordinary linear regression. Consequently, it is important to have a good understanding of linear regression before studying more complex learning methods. This module introduces linear regression with an emphasis on prediction, rather than inference. An excellent and comprehensive overview of linear regression is provided in Kutner et al. (2005). See Faraway (2016) for a discussion of linear regression in R (the book’s website also provides Python scripts).
By the end of this module you will know how to:
# Helper packages
import pandas as pd
import numpy as np
# Modeling packages
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import compose
from sklearn import cross_decomposition
from sklearn import decomposition
from sklearn import model_selection
from sklearn import linear_model
from sklearn import pipeline
# Ames housing data
ames = pd.read_csv("data/ames.csv")
# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)
# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]
# Helper packages
library(dplyr) # for data manipulation
library(ggplot2) # for awesome graphics
# Modeling packages
library(tidymodels)
library(plsmod)
# Model interpretability packages
library(vip) # variable importance
# Ames housing data
ames <- AmesHousing::make_ames()
# create training/test split
set.seed(123)
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)
Pearson’s correlation coefficient is often used to quantify the strength of the linear association between two continuous variables. In this section, we seek to fully characterize that linear relationship. Simple linear regression (SLR) assumes that the statistical relationship between two continuous variables (say \(X\) and \(Y\)) is (at least approximately) linear:
\[\begin{equation} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad \text{for } i = 1, 2, \dots, n, \end{equation}\]
where \(Y_i\) represents the i-th response value, \(X_i\) represents the i-th feature value, \(\beta_0\) and \(\beta_1\) are fixed, but unknown constants (commonly referred to as coefficients or parameters) that represent the intercept and slope of the regression line, respectively, and \(\epsilon_i\) represents noise or random error. In this module, we’ll assume that the errors are normally distributed with mean zero and constant variance \(\sigma^2\), denoted \(\stackrel{iid}{\sim} \left(0, \sigma^2\right)\). Since the random errors are centered around zero (i.e., \(E\left(\epsilon\right) = 0\)), linear regression is really a problem of estimating a conditional mean:
\[\begin{equation} E\left(Y_i | X_i\right) = \beta_0 + \beta_1 X_i. \end{equation}\]
For brevity, we often drop the conditional piece and write \(E\left(Y | X\right) = E\left(Y\right)\). Consequently, the interpretation of the coefficients is in terms of the average, or mean response. For example, the intercept \(\beta_0\) represents the average response value when \(X = 0\) (it is often not meaningful or of interest and is sometimes referred to as a bias term). The slope \(\beta_1\) represents the increase in the average response per one-unit increase in \(X\) (i.e., it is a rate of change).
Ideally, we want estimates of \(\beta_0\) and \(\beta_1\) that give us the “best fitting” line. But what is meant by “best fitting?” The most common approach is to use the method of least squares (LS) estimation; this form of linear regression is often referred to as ordinary least squares (OLS) regression. There are multiple ways to measure “best fitting,” but the LS criterion finds the “best fitting” line by minimizing the residual sum of squares (RSS):
\[\begin{equation} RSS\left(\beta_0, \beta_1\right) = \sum_{i=1}^n\left[Y_i - \left(\beta_0 + \beta_1 X_i\right)\right]^2 = \sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2. \end{equation}\]
The LS estimates of \(\beta_0\) and \(\beta_1\) are denoted as \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\), respectively. Once obtained, we can generate predicted values, say at \(X = X_{new}\), using the estimated regression equation:
\[\begin{equation} \widehat{Y}_{new} = \widehat{\beta}_0 + \widehat{\beta}_1 X_{new}, \end{equation}\]
where \(\widehat{Y}_{new} = \widehat{E\left(Y_{new} | X = X_{new}\right)}\) is the estimated mean response at \(X = X_{new}\).
With the Ames housing data, suppose we wanted to model a linear relationship between the total above ground living space of a home (Gr_Liv_Area
) and sale price (Sale_Price
). We can perform an OLS regression model in Python and R with the following:
# create linear regression model object
lm_mod = linear_model.LinearRegression()
# fit linear model with only Gr_Liv_Area feature
lm_fit = lm_mod.fit(X_train[["Gr_Liv_Area"]], y_train)
# create linear regression model object
lm_mod <- linear_reg() %>% set_engine("lm")
# fit linear model with only Gr_Liv_Area feature
lm_fit <-
lm_mod %>%
fit(Sale_Price ~ Gr_Liv_Area, data = ames_train)
The fitted model (lm_fit
) is displayed in the left plot below where the points represent the values of Sale_Price
in the training data. In the right plot, the vertical lines represent the individual errors, called residuals, associated with each observation. The OLS criterion identifies the “best fitting” line that minimizes the sum of squares of these residuals.
The estimated coefficients from our models are shown in the code chunks below. Recall that there are slight differences between R & Python because the test/ & train data splits differ between the two instances. However, the results are very similar with our intercept ( \(\widehat{\beta}_0\) ) equaling around 15,800 and the coefficient for the Gr_Liv_Area
variable ( \(\widehat{\beta}_1\) ) equaling approximately 110. To interpret, we estimate that the mean selling price of a home increases by about $110 for each additional one square foot of above ground living space. This simple description of the relationship between the sale price and square footage using a single number (i.e. the slope) is what makes linear regression such an intuitive and popular modeling tool.
# intercept
lm_fit.intercept_
## array([15770.15274496])
# coefficient for Gr_Liv_Area
lm_fit.coef_
## array([[110.54517166]])
lm_fit
## parsnip model object
##
## Fit time: 4ms
##
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
##
## Coefficients:
## (Intercept) Gr_Liv_Area
## 15938.2 109.7
In practice, we often have more than one predictor. For example, with the Ames housing data, we may wish to understand if above ground square footage (Gr_Liv_Area
) and the year the house was built (Year_Built
) are (linearly) related to sale price (Sale_Price
). We can extend the SLR model so that it can directly accommodate multiple predictors; this is referred to as the multiple linear regression (MLR) model. With two predictors, the MLR model becomes:
\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon, \end{equation}\]
where \(X_1\) and \(X_2\) are features of interest. In our Ames housing example, \(X_1\) represents Gr_Liv_Area
and \(X_2\) represents Year_Built
.
We can fit an MLR with Python and R as follows.
# create linear regression model object
lm_mod = linear_model.LinearRegression()
# fit linear model with only Gr_Liv_Area and Year_Built feature
lm_fit = lm_mod.fit(X_train[["Gr_Liv_Area", "Year_Built"]], y_train)
# create linear regression model object
lm_mod <- linear_reg() %>% set_engine("lm")
# fit linear model with only Gr_Liv_Area and Year_Built feature
lm_fit <-
lm_mod %>%
fit(Sale_Price ~ Gr_Liv_Area + Year_Built, data = ames_train)
The LS estimates of the regression coefficients for Gr_Liv_Area
(\(\widehat{\beta}_1\)) is approximately 94 and for Year_Built
(\(\widehat{\beta}_2\)) it is approximately 1090 (the estimated intercept is around -2100000. In other words, every one square foot increase to above ground square footage is associated with an additional $94 in mean selling price when holding the year the house was built constant. Likewise, for every year newer a home is there is approximately an increase of $1,090 in selling price when holding the above ground square footage constant.
Recall the slight difference in coefficient values is due to the fact that we are using a different training set between R and Python.
# intercept
lm_fit.intercept_
## array([-2127545.21682598])
# coefficients for Gr_Liv_Area and Year_Built
lm_fit.coef_
## array([[ 94.82771439, 1099.10620648]])
lm_fit
## parsnip model object
##
## Fit time: 2ms
##
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area + Year_Built, data = data)
##
## Coefficients:
## (Intercept) Gr_Liv_Area Year_Built
## -2.103e+06 9.383e+01 1.087e+03
A contour plot of our most recent fitted regression surface is displayed in the left side of the figure below. Note how the fitted regression surface is flat (i.e., it does not twist or bend). This is true for all linear models that include only main effects (i.e., terms involving only a single predictor). One way to model curvature is to include interaction effects. An interaction occurs when the effect of one predictor on the response depends on the values of other predictors. In linear regression, interactions can be captured via products of features (i.e., \(X_1 \times X_2\)). A model with two main effects can also include a two-way interaction. For example, to include an interaction between \(X_1 =\) Gr_Liv_Area
and \(X_2 =\) Year_Built
, we introduce an additional product term:
\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon. \end{equation}\]
Interaction effects are quite prevalent in predictive modeling. Since linear models are an example of parametric modeling, it is up to the analyst to decide if and when to include interaction effects. In later modules, we’ll discuss algorithms that can automatically detect and incorporate interaction effects (albeit in different ways). It is also important to understand a concept called the hierarchy principle—which demands that all lower-order terms corresponding to an interaction be retained in the model—when considering interaction effects in linear regression models.
In R and Python we can fit models with an interaction effect in the following manner:
# create linear regression model object
lm_mod = linear_model.LinearRegression()
# use PolynomialFeatures to create main Gr_Liv_Area and Year_Built effects and
# also an interaction effect between Gr_Liv_Area & Year_Built
effects = preprocessing.PolynomialFeatures(
interaction_only=True,
include_bias=False
)
features = effects.fit_transform(X_train[["Gr_Liv_Area", "Year_Built"]])
# fit linear model with only Gr_Liv_Area and Year_Built feature and
# also include an interaction effect (Gr_Liv_Area:Year_Built)
lm_fit = lm_mod.fit(features, y_train)
# coefficients for Gr_Liv_Area, Year_Built effects and the interaction
# effect between Gr_Liv_Area & Year_Built
lm_fit.coef_
## array([[-6.62106855e+02, 4.90386820e+02, 3.84056493e-01]])
# create linear regression model object
lm_mod <- linear_reg() %>% set_engine("lm")
# fit linear model with Gr_Liv_Area and Year_Built main effects and
# also include an interaction effect (Gr_Liv_Area:Year_Built)
lm_fit <-
lm_mod %>%
fit(
Sale_Price ~ Gr_Liv_Area + Year_Built + Gr_Liv_Area:Year_Built,
data = ames_train
)
lm_fit
## parsnip model object
##
## Fit time: 2ms
##
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area + Year_Built + Gr_Liv_Area:Year_Built,
## data = data)
##
## Coefficients:
## (Intercept) Gr_Liv_Area Year_Built
## -8.105e+05 -7.285e+02 4.309e+02
## Gr_Liv_Area:Year_Built
## 4.168e-01
In general, we can include as many predictors as we want, as long as we have more rows than parameters! The general multiple linear regression model with p distinct predictors is
\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon, \end{equation}\]
where \(X_i\) for \(i = 1, 2, \dots, p\) are the predictors of interest. Note some of these may represent interactions (e.g., \(X_3 = X_1 \times X_2\)) between our transformations1 (e.g., \(X_4 = \sqrt{X_1}\)) of the original features. Unfortunately, visualizing beyond three dimensions is not practical as our best-fit plane becomes a hyperplane. However, the motivation remains the same where the best-fit hyperplane is identified by minimizing the RSS.
So say we want to run an MLR model with all the features in our Ames data set as main effects (i.e., no interaction terms) to predict Sale_Price
. The following allows us to do so:
As discussed in the feature engineering module, Python does not automatically handle categorical features. Consequently, to use all the features in the Ames housing data we need to preprocess the categorical features to turn them into numeric representations. The following simply dummy encodes all non-numeric features and then uses this transformed feature set in an MLR model.
# create new feature set with categorical features dummy encoded
encoder = preprocessing.OneHotEncoder(drop='first')
cat_feat_only = compose.make_column_selector(dtype_include="object")
preprocessor = compose.ColumnTransformer(
remainder="passthrough",
transformers=[("one-hot", encoder, cat_feat_only)]
)
X_train_encoded = preprocessor.fit_transform(X_train)
# MLR model with new dummy encoded feature set
lm_mod = linear_model.LinearRegression()
lm_fit = lm_mod.fit(X_train_encoded, y_train)
We can see the first 10 coefficients as follows:
lm_fit.coef_[0, 0:10]
## array([ 7053.66861508, 2919.81256667, -12106.538252 , 13728.62997568,
## 16717.06921586, 22098.14806837, -6801.68767744, -4196.18259742,
## -6176.20220396, -1275.59421824])
In R, the shorthand notation to include all features is target_variable ~ .
. Behind the scenes, R will automatically dummy encode all categorical features. Since there are a lot of features we can use tidy
to summarize the results in a more predictable and useful format (e.g. a data frame with standard column names). The “estimate” column represents the estimated coefficients.
# create linear regression model object
lm_mod <- linear_reg() %>% set_engine("lm")
# fit linear model with all features
lm_fit <-
lm_mod %>%
fit(Sale_Price ~ ., data = ames_train)
tidy(lm_fit)
## # A tibble: 300 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.73e7 11016450. -2.47 0.0134
## 2 MS_SubClassOne_Story_1945_and_Older 3.90e3 3575. 1.09 0.275
## 3 MS_SubClassOne_Story_with_Finished_Att… -5.39e3 12164. -0.443 0.658
## 4 MS_SubClassOne_and_Half_Story_Unfinish… -4.41e2 13942. -0.0316 0.975
## 5 MS_SubClassOne_and_Half_Story_Finished… 1.04e3 7250. 0.143 0.886
## 6 MS_SubClassTwo_Story_1946_and_Newer -6.67e3 5510. -1.21 0.226
## 7 MS_SubClassTwo_Story_1945_and_Older 1.57e3 6074. 0.259 0.795
## 8 MS_SubClassTwo_and_Half_Story_All_Ages 3.41e3 10149. 0.336 0.737
## 9 MS_SubClassSplit_or_Multilevel -6.67e3 11673. -0.571 0.568
## 10 MS_SubClassSplit_Foyer 1.49e3 7512. 0.199 0.843
## # … with 290 more rows
We’ve fit three main effects models to the Ames housing data: a single predictor, two predictors, and all possible predictors. But the question remains, which model is “best?” To answer this question we have to define what we mean by “best.” In our case, we’ll use the RMSE metric along with cross-validation to determine the “best” model.
To compare the same model type across multiple different feature combinations we:
# feature sets to compare across
feature_set1 = X_train[["Gr_Liv_Area"]]
feature_set2 = X_train[["Gr_Liv_Area", "Year_Built"]]
feature_set3 = X_train_encoded
feature_sets = {'lm1': feature_set1, 'lm2': feature_set2, 'lm3': feature_set3}
# define loss function
loss = 'neg_root_mean_squared_error'
# create 10 fold CV object
kfold = model_selection.KFold(n_splits=10, random_state=8451, shuffle=True)
# object to store CV RMSE results
results = {}
for name, feat in feature_sets.items():
# create LM model object
lm_mod = linear_model.LinearRegression()
# execute and score the cross validation procedure
cv_results = model_selection.cross_val_score(
estimator=lm_mod,
X=feat,
y=y_train,
cv=kfold,
scoring=loss
)
results[name] = np.absolute(cv_results.mean())
We can see that as we add additional features our CV RMSE decreases.
results
## {'lm1': 58191.00896279409, 'lm2': 48202.33209737469, 'lm3': 34562.00023075457}
To compare different model specifications in R we:
# create linear regression model object
lm_mod <- linear_reg() %>% set_engine("lm")
# create three model recipes
lm1 <- recipe(Sale_Price ~ Gr_Liv_Area, data = ames_train) %>%
step_dummy(all_nominal_predictors())
lm2 <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built, data = ames_train) %>%
step_dummy(all_nominal_predictors())
lm3 <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_dummy(all_nominal_predictors())
# combine model objects and recipes into a workflow object
preproc <- list(lm1, lm2, lm3)
models <- list(lm_mod)
model_set <- workflow_set(preproc, models, cross = TRUE)
model_set
## # A workflow set/tibble: 3 x 4
## wflow_id info option result
## <chr> <list> <list> <list>
## 1 recipe_1_linear_reg <tibble [1 × 4]> <wrkflw__ > <list [0]>
## 2 recipe_2_linear_reg <tibble [1 × 4]> <wrkflw__ > <list [0]>
## 3 recipe_3_linear_reg <tibble [1 × 4]> <wrkflw__ > <list [0]>
# create our k-fold CV object
kfold <- vfold_cv(ames_train, v = 10)
# iterate over our workflow object to execute and score the cross
# validation procedure
lm_models <- model_set %>%
workflow_map("fit_resamples",
seed = 8451,
resamples = kfold)
lm_models
## # A workflow set/tibble: 3 x 4
## wflow_id info option result
## <chr> <list> <list> <list>
## 1 recipe_1_linear_reg <tibble [1 × 4]> <wrkflw__ > <rsmp[+]>
## 2 recipe_2_linear_reg <tibble [1 × 4]> <wrkflw__ > <rsmp[+]>
## 3 recipe_3_linear_reg <tibble [1 × 4]> <wrkflw__ > <rsmp[+]>
We can now access the CV RMSE, which is in the “mean” column. We can see that as we add additional features our CV RMSE decreases.
collect_metrics(lm_models) %>%
filter(.metric == "rmse")
## # A tibble: 3 x 9
## wflow_id .config preproc model .metric .estimator mean n std_err
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 recipe_1_li… Preproces… recipe linea… rmse standard 56812. 10 1568.
## 2 recipe_2_li… Preproces… recipe linea… rmse standard 47071. 10 1628.
## 3 recipe_3_li… Preproces… recipe linea… rmse standard 41008. 10 5954.
R’s workflow procedure may seem daunting at first but it provides a lot of flexibility once you become comfortable with the concept. Read more about workflows here: https://www.tmwr.org/workflows.html
Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in OLS, since it can be difficult to separate out the individual effects of collinear variables on the response. In fact, collinearity can cause predictor variables to appear as statistically insignificant when in fact they are significant. This obviously leads to an inaccurate interpretation of coefficients and makes it difficult to identify influential predictors.
In the Ames data, for example, Garage_Area
and Garage_Cars
are two variables that have a correlation of 0.89 and both variables are strongly related to our response variable (Sale_Price
). If we fit a model with all main effects where both of these variables are included, note the coefficient estimate and p-values below for Garage_Area
and Garage_Cars
.
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Garage_Cars 3151. 1733. 1.82 0.0693
## 2 Garage_Area 11.9 5.69 2.10 0.0359
However, if we refit the full model without Garage_Area
, the coefficient estimate for Garage_Car
increases nearly two fold and the p-value becomes much more significant.
## # A tibble: 1 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Garage_Cars 5765. 1207. 4.78 0.00000194
This reflects the instability in the linear regression model caused by between-predictor relationships; this instability also gets propagated directly to the model predictions. Considering 16 of our 34 numeric predictors have a medium to strong correlation, the biased coefficients of these predictors are likely restricting the predictive accuracy of our model. How can we control for this problem? One option is to manually remove the offending predictors (one-at-a-time) until all pairwise correlations are below some pre-determined threshold. However, when the number of predictors is large such as in our case, this becomes tedious. Moreover, multicollinearity can arise when one feature is linearly related to two or more features (which is more difficult to detect2). In these cases, manual removal of specific predictors may not be possible. Consequently, this section and the next offer two simple extensions of linear regression where dimension reduction is applied prior to performing linear regression. A later module on regularized regression offers a modified approach that helps to deal with the problem. And future modules provide alternative methods that are less affected by multicollinearity.
As mentioned in the feature engineering module principal components analysis can be used to represent correlated variables with a smaller number of uncorrelated features (called principal components) and the resulting components can be used as predictors in a linear regression model. This two-step process is known as principal component regression (PCR) (Massy 1965) and is illustrated below.
Performing PCR with R and Python is an easy extension from our previous models. Often, we can greatly improve performance by only using a small subset of all principal components as predictors. For starters, we’ll just use 10 components. Note that our preprocessing will first normalize all numeric features and then perform PCA on these features. After the PCA we then one-hot encode our categorical features.
# create linear model object
lm_mod = linear_model.LinearRegression()
# create k-fold cross validation object
kfold = model_selection.KFold(n_splits=5, random_state=8451, shuffle=True)
# define loss function
loss = 'neg_root_mean_squared_error'
# create our preprocessing steps which includes performing PCA
# with 10 components
scaler = preprocessing.StandardScaler()
pca = decomposition.PCA(n_components=10)
encoder = preprocessing.OneHotEncoder(handle_unknown="ignore")
num_feat_only = compose.make_column_selector(dtype_include="number")
cat_feat_only = compose.make_column_selector(dtype_include="object")
# combine all steps into a pre-processing pipeline
preprocessor = compose.ColumnTransformer(
remainder="passthrough",
transformers=[
("std_encode", scaler, num_feat_only),
("pca_encode", pca, num_feat_only),
("one-hot", encoder, cat_feat_only),
])
# create a pipeline object that combines model with recipe
model_pipeline = pipeline.Pipeline(steps=[
("preprocessor", preprocessor),
("lm", lm_mod),
])
# train and fit our model
cv_results = model_selection.cross_val_score(
estimator=model_pipeline,
X=X_train,
y=y_train,
cv=kfold,
scoring=loss
)
# get results
np.absolute(cv_results.mean())
## 34221.48344930283
# create linear model object
lm_mod <- linear_reg() %>% set_engine("lm")
# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)
# create our preprocessing steps which includes performing PCA
# with 10 components
model_form <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_pca(all_numeric(), -all_outcomes(), num_comp = 10) %>%
step_dummy(all_nominal_predictors())
# create a workflow object that combines model with recipe
pca_wf <- workflow() %>%
add_model(lm_mod) %>%
add_recipe(model_form)
# train and fit our model
set.seed(123)
pca_fit_rs <- pca_wf %>%
fit_resamples(folds)
# get results
collect_metrics(pca_fit_rs) %>%
filter(.metric == "rmse")
## # A tibble: 1 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 38983. 5 2438. Preprocessor1_Model1
In the above example, we simply assumed 10 principal components was satisfactory. This is rarely the case and the optimal number of principal components to use will change depending on the data set and correlation amongst the predictors. However, often, we can greatly improve performance by only using a small subset of all principal components as predictors. Consequently, you can think of the number of principal components as a tuning parameter. The following performs cross-validated PCR with \(1, 2, \dots, 26\) principal components, and the plot below illustrates the cross-validated RMSE. You can see a significant drop in prediction error from our previous linear models using just a few principal components.
To tune our Python model we simply include a for
loop that iterates over 2, 4, 6, …, 26 components.
# create k-fold cross validation object
kfold = model_selection.KFold(n_splits=5, random_state=8451, shuffle=True)
# define loss function
loss = 'neg_root_mean_squared_error'
# create our preprocessing steps
scaler = preprocessing.StandardScaler()
encoder = preprocessing.OneHotEncoder(handle_unknown="ignore")
num_feat_only = compose.make_column_selector(dtype_include="number")
cat_feat_only = compose.make_column_selector(dtype_include="object")
# create object to save results
results = {}
# iterate over over 2, 4, 6, ..., 26 components and train model
for n_comp in range(2, 28, 2):
# create PCA object with n components
pca = decomposition.PCA(n_components=n_comp)
# combine all steps into a pre-processing pipeline
preprocessor = compose.ColumnTransformer(
remainder="passthrough",
transformers=[
("std_encode", scaler, num_feat_only),
("pca_encode", pca, num_feat_only),
("one-hot", encoder, cat_feat_only),
])
# create linear model object
lm_mod = linear_model.LinearRegression()
# create a pipeline object that combines model with recipe
model_pipeline = pipeline.Pipeline(steps=[
("preprocessor", preprocessor),
("lm", lm_mod),
])
# train and fit our model
cv_results = model_selection.cross_val_score(
estimator=model_pipeline,
X=X_train,
y=y_train,
cv=kfold,
scoring=loss
)
# get results
results[n_comp] = np.absolute(cv_results.mean())
pd.DataFrame.from_dict(
results,
orient='index',
columns=['RMSE']
).rename_axis('n_components').reset_index()
## n_components RMSE
## 0 2 35677.816818
## 1 4 34675.980057
## 2 6 34293.406987
## 3 8 34350.497524
## 4 10 34334.501044
## 5 12 34108.333581
## 6 14 34124.599805
## 7 16 33941.978026
## 8 18 33836.234345
## 9 20 33422.600499
## 10 22 33318.811912
## 11 24 33288.292994
## 12 26 33257.730855
To tune our R model, we’ll use tune()
in place of the num_comp
parameter in step_pca()
. Our results show that multiple models, all with very low number of components, perform relatively the same.
# create linear model object
lm_mod <- linear_reg() %>% set_engine("lm")
# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)
# create our model recipe with a tuning option for number of components
model_form <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_pca(all_numeric(), -all_outcomes(), num_comp = tune()) %>%
step_dummy(all_nominal_predictors())
# create a hyper parameter tuning grid for number of components
hyper_grid <- expand.grid(num_comp = seq(1, 26, by = 2))
# train our model across the hyper parameter grid
set.seed(8451)
results <- tune_grid(lm_mod, model_form, resamples = folds, grid = hyper_grid)
# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
## num_comp .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 38808. 5 3448. Preprocessor01_Model1
## 2 5 rmse standard 39753. 5 3857. Preprocessor03_Model1
## 3 9 rmse standard 39823. 5 3855. Preprocessor05_Model1
## 4 7 rmse standard 39826. 5 3885. Preprocessor04_Model1
## 5 3 rmse standard 39891. 5 3770. Preprocessor02_Model1
It’s important to note that since PCR is a two step process, the PCA step does not consider any aspects of the response when it selects the components. Consequently, the new predictors produced by the PCA step are not designed to maximize the relationship with the response. Instead, it simply seeks to reduce the variability present throughout the predictor space. If that variability happens to be related to the response variability, then PCR has a good chance to identify a predictive relationship, as in our case. If, however, the variability in the predictor space is not related to the variability of the response, then PCR can have difficulty identifying a predictive relationship when one might actually exists (i.e., we may actually experience a decrease in our predictive accuracy). An alternative approach to reduce the impact of multicollinearity is partial least squares (PLS).
PLS can be viewed as a supervised dimension reduction procedure (Kuhn and Johnson 2013). Similar to PCR, this technique also constructs a set of linear combinations of the inputs for regression, but unlike PCR it uses the response variable to aid the construction of the principal components as illustrated below. Thus, we can think of PLS as a supervised dimension reduction procedure that finds new features that not only captures most of the information in the original features, but also are related to the response.
We illustrate PLS with some exemplar data3. The below plot illustrates that the first two PCs when using PCR have very little relationship to the response variable; however, the first two PCs when using PLS have a much stronger association to the response.
PLS will compute the principals by taking into consideration the correlation between the features and the response variable. See Friedman, Hastie, and Tibshirani (2001) and Geladi and Kowalski (1986) for a thorough discussion of PLS.
To perform PLS in Python we use PLSRegression()
from the cross_decomposition
module. Note that PLSRegression()
only accepts a dense matrix so we need to set sparse=False
when one-hot encoding (by default OneHotEncoder
returns a sparse matrix). In this example we execute a hyperparameter grid search over 2-26 components. Our results show that PLS performs better than PCR.
# create linear model object
pls_mod = cross_decomposition.PLSRegression()
# create k-fold cross validation object
kfold = model_selection.KFold(n_splits=5, random_state=8451, shuffle=True)
# define loss function
loss = 'neg_root_mean_squared_error'
# create our preprocessing steps to normalize and one-hot encode
scaler = preprocessing.StandardScaler()
encoder = preprocessing.OneHotEncoder(handle_unknown="ignore", sparse=False)
num_feat_only = compose.make_column_selector(dtype_include="number")
cat_feat_only = compose.make_column_selector(dtype_include="object")
# combine all steps into a pre-processing pipeline
preprocessor = compose.ColumnTransformer(
remainder="passthrough",
transformers=[
("std_encode", scaler, num_feat_only),
("one-hot", encoder, cat_feat_only),
])
# create a pipeline object that combines model with recipe
model_pipeline = pipeline.Pipeline(steps=[
("preprocessor", preprocessor),
("pls", pls_mod),
])
# Create grid of hyperparameter values
hyper_grid = {'pls__n_components': range(2, 28, 2)}
# Tune a knn model using grid search
grid_search = model_selection.GridSearchCV(
model_pipeline,
hyper_grid,
cv=kfold,
scoring=loss
)
results = grid_search.fit(X_train, y_train)
# Best model's cross validated RMSE
abs(results.best_score_)
## 30834.84469515183
The optimal number of components that provided the lowest RMSE is:
# Optimal number of components
results.best_estimator_.get_params().get('pls__n_components')
## 4
To perform PLS in R we use pls()
from the plsmod
package. Note that plsmod
requires the mixOmics
package to be installed. In this example we execute a hyperparameter grid search over 2-26 components. Our results show that PLS performs quite a bit better than PCR.
# create PLS model object with a tuning option for number of components
pls_mod <- plsmod::pls(num_comp = tune()) %>%
set_engine("mixOmics") %>%
set_mode("regression")
# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)
# create our model recipe
model_form <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal_predictors())
# create a hyper parameter tuning grid for number of components
hyper_grid <- expand.grid(num_comp = seq(2, 26, by = 2))
# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(pls_mod, model_form, resamples = folds, grid = hyper_grid)
# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
## num_comp .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 rmse standard 33223. 5 2620. Preprocessor1_Model01
## 2 4 rmse standard 33399. 5 2257. Preprocessor1_Model02
## 3 22 rmse standard 33500. 5 4586. Preprocessor1_Model11
## 4 24 rmse standard 33533. 5 4698. Preprocessor1_Model12
## 5 26 rmse standard 33570. 5 4771. Preprocessor1_Model13
Linear regression is usually the first supervised learning algorithm you will learn. The approach provides a solid fundamental understanding of the supervised learning task; however, as we’ve discussed there are several concerns that result from the assumptions required. Although extensions of linear regression that integrate dimension reduction steps into the algorithm can help address some of the problems with linear regression, more advanced supervised algorithms typically provide greater flexibility and improved accuracy. Nonetheless, understanding linear regression provides a foundation that will serve you well in learning these more advanced methods.
Using the Boston housing data set where the response feature is the median value of homes within a census tract (cmedv
):
Transformations of the features serve a number of purposes (e.g., modeling nonlinear relationships or alleviating departures from common regression assumptions). See Kutner et al. (2005) for details.↩︎
In such cases we can use a statistic called the variance inflation factor which tries to capture how strongly each feature is linearly related to all the others predictors in a model.↩︎
This is actually using the solubility data that is provided by the AppliedPredictiveModeling package (Kuhn and Johnson 2018).↩︎