The previous modules discussed algorithms that are intrinsically linear. Many of these models can be adapted to nonlinear patterns in the data by manually adding nonlinear model terms (e.g., squared terms, interaction effects, and other transformations of the original features); however, to do so you the analyst must know the specific nature of the nonlinearities and interactions a priori. Alternatively, there are numerous algorithms that are inherently nonlinear. When using these models, the exact form of the nonlinearity does not need to be known explicitly or specified prior to model training. Rather, these algorithms will search for, and discover, nonlinearities and interactions in the data that help maximize predictive accuracy.
This module discusses multivariate adaptive regression splines (MARS) (Friedman 1991), an algorithm that automatically creates a piecewise linear model which provides an intuitive stepping block into nonlinearity after grasping the concept of multiple linear regression. Future modules will focus on other nonlinear algorithms.
By the end of this module you will know how:
# Helper packages
import numpy as np
import pandas as pd
from plotnine import *
# Modeling packages
from pyearth import Earth
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.inspection import partial_dependence
# Ames housing data
ames = pd.read_csv("data/ames.csv")
# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)
# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]
# Helper packages
library(tidyverse) # for data wrangling & plotting
# Modeling packages
library(tidymodels) # for fitting MARS models
# Model interpretability packages
library(vip) # for variable importance
library(pdp) # for variable relationships
# Stratified sampling with the rsample package
set.seed(123)
ames <- AmesHousing::make_ames()
split <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- rsample::training(split)
ames_test <- rsample::testing(split)
In the previous modules, we focused on linear models (where the analyst has to explicitly specify any nonlinear relationships and interaction effects). We illustrated some of the advantages of linear models such as their ease and speed of computation and also the intuitive nature of interpreting their coefficients. However, linear models make a strong assumption about linearity, and this assumption is often a poor one, which can affect predictive accuracy.
We can extend linear models to capture any non-linear relationship. Typically, this is done by explicitly including polynomial terms (e.g., \(x_i^2\)) or step functions. Polynomial regression is a form of regression in which the relationship between \(X\) and \(Y\) is modeled as a \(d\)th degree polynomial in \(X\). For example, the following equation represents a polynomial regression function where \(Y\) is modeled as a \(d\)-th degree polynomial in \(X\). Generally speaking, it is unusual to use \(d\) greater than 3 or 4 as the larger \(d\) becomes, the easier the function fit becomes overly flexible and oddly shaped…especially near the boundaries of the range of \(X\) values. Increasing \(d\) also tends to increase the presence of multicollinearity.
\[\begin{equation} y_i = \beta_0 + \beta_1 x_i + \beta_2 x^2_i + \beta_3 x^3_i \dots + \beta_d x^d_i + \epsilon_i, \end{equation}\]
An alternative to polynomials is to use step functions. Whereas polynomial functions impose a global non-linear relationship, step functions break the range of \(X\) into bins, and fit a simple constant (e.g., the mean response) in each. This amounts to converting a continuous feature into an ordered categorical variable such that our linear regression function is converted to the following equation
\[\begin{equation} y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + \beta_3 C_3(x_i) \dots + \beta_d C_d(x_i) + \epsilon_i, \end{equation}\]
where \(C_1(x_i)\) represents \(x_i\) values ranging from \(c_1 \leq x_i < c_2\), \(C_2\left(x_i\right)\) represents \(x_i\) values ranging from \(c_2 \leq x_i < c_3\), \(\dots\), \(C_d\left(x_i\right)\) represents \(x_i\) values ranging from \(c_{d-1} \leq x_i < c_d\). The figure below contrasts linear, polynomial, and step function fits for non-linear, non-monotonic simulated data.
Although useful, the typical implementation of polynomial regression and step functions require the user to explicitly identify and incorporate which variables should have what specific degree of interaction or at what points of a variable \(X\) should cut points be made for the step functions. Considering many data sets today can easily contain 50, 100, or more features, this would require an enormous and unnecessary time commitment from an analyst to determine these explicit non-linear settings.
Multivariate adaptive regression splines (MARS) provide a convenient approach to capture the nonlinear relationships in the data by assessing cutpoints (knots) similar to step functions. The procedure assesses each data point for each predictor as a knot and creates a linear regression model with the candidate feature(s). For example, consider our non-linear, non-monotonic data above where \(Y = f\left(X\right)\). The MARS procedure will first look for the single point across the range of X
values where two different linear relationships between Y
and X
achieve the smallest error (e.g., smallest SSE). What results is known as a hinge function \(h\left(x-a\right)\), where \(a\) is the cutpoint value. For a single knot (Figure (A)), our hinge function is \(h\left(\text{x}-1.183606\right)\) such that our two linear models for Y
are
\[\begin{equation} \text{y} = \begin{cases} \beta_0 + \beta_1(1.183606 - \text{x}) & \text{x} < 1.183606, \\ \beta_0 + \beta_1(\text{x} - 1.183606) & \text{x} > 1.183606 \end{cases} \end{equation}\]
Once the first knot has been found, the search continues for a second knot which is found at \(x = 4.898114\) (plot B in figure below). This results in three linear models for y
:
\[\begin{equation} \text{y} = \begin{cases} \beta_0 + \beta_1(1.183606 - \text{x}) & \text{x} < 1.183606, \\ \beta_0 + \beta_1(\text{x} - 1.183606) & \text{x} > 1.183606 \quad \& \quad \text{x} < 4.898114, \\ \beta_0 + \beta_1(4.898114 - \text{x}) & \text{x} > 4.898114 \end{cases} \end{equation}\]
This procedure continues until many knots are found, producing a (potentially) highly non-linear prediction equation. Although including many knots may allow us to fit a really good relationship with our training data, it may not generalize very well to new, unseen data. Consequently, once the full set of knots has been identified, we can sequentially remove knots that do not contribute significantly to predictive accuracy. This process is known as “pruning” and we can use cross-validation, as we have with the previous models, to find the optimal number of knots.
We’ll start by fitting a basic model using Gr_Liv_Area
and Year_Built
to predict our Sales_Price
. In both R and Python, the modeling algorithm will assess all potential knots across all supplied features and then will prune to the optimal number of knots based on an expected change in \(R^2\) (for the training data) of less than 0.001. This calculation is performed by the Generalized cross-validation (GCV) procedure, which is a computational shortcut for linear models that produces an approximate leave-one-out cross-validation error metric (Golub, Heath, and Wahba 1979).
Note that MARS models do not require normalization or standardization of numeric features.
In both languages, the GCV is reported along with other metrics (\(R^2\), residual sum of squares (R) and MSE (Python)). We can use this info to compute the RMSE, which is in the low to mid $40K range. Keep in mind that this RMSE is not a cross-validated RMSE, just the RMSe on the training data (we’ll do some cross-validation shortly).
You’ll notice that in both the R and Python examples we see a knot in Gr_Liv_Area
at about 3500 (3395 in R and 3500 in Python). This suggests that as homes exceed 3500 square feet there is a change in the linear relationship between Gr_Liv_Area
and Sale_Price
compared to homes that are less than 3500 square feet. In both cases we see multiple knots across the range of Gr_Liv_Area
and Year_Built
. in fact, our model has taken two features (Gr_Liv_Area
and Year_Built
) and split them into 8-10 features (8 in R and 10 in Python) by creating knots along these feature values.
The most important tuning parameter in a MARS model is the number of knots to use for a given feature. However, this is automatically baked into the algorithm so we get this tuning for free!
The term “MARS” is trademarked and licensed exclusively to Salford Systems: https://www.salford-systems.com. We can use MARS as an abbreviation; however, it cannot be used for competing software solutions. This is why the R and Python packages use the name earth.
In Python we use the Earth()
model provided by the pyearth library. pyearth is one of many packages that are considered scikit-learn compatible “plug-in” packages. You can read more about these compatible plug-in packages here.
# Step 1: create model object
earth_mod = Earth()
# Step 2: fit/train model
earth_fit = earth_mod.fit(X_train[["Gr_Liv_Area", "Year_Built"]], y_train)
# Step 3: results
print(earth_fit.summary())
## Earth Model
## ------------------------------------------
## Basis Function Pruned Coefficient
## ------------------------------------------
## (Intercept) No 392053
## h(Gr_Liv_Area-3500) No -379.207
## h(3500-Gr_Liv_Area) No -85.3651
## h(Year_Built-2009) No 142719
## h(2009-Year_Built) No -1136.79
## h(Gr_Liv_Area-1905) No 46.5807
## h(1905-Gr_Liv_Area) Yes None
## h(Gr_Liv_Area-3140) No 194.026
## h(3140-Gr_Liv_Area) Yes None
## h(Gr_Liv_Area-2798) Yes None
## h(2798-Gr_Liv_Area) Yes None
## ------------------------------------------
## MSE: 2114325265.1482, GCV: 2147703342.5369, RSQ: 0.6767, GRSQ: 0.6719
# RMSE
np.sqrt(earth_fit.mse_)
## 45981.7927570051
In R, we use the parsnip::mars()
function which provides a default setting of using the earth
package, which is the most popular package for MARS functionality in R.
# Step 1: create ridge model object
mars_mod <- mars(mode = "regression")
# Step 2: create model & preprocessing recipe
model_recipe <- recipe(
Sale_Price ~ Gr_Liv_Area + Year_Built,
data = ames_train
)
# Step 3: fit model workflow
mars_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(mars_mod) %>%
fit(data = ames_train)
# Step 4: results
mars_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Selected 9 of 10 terms, and 2 of 2 predictors
## Termination condition: RSq changed by less than 0.001 at 10 terms
## Importance: Gr_Liv_Area, Year_Built
## Number of terms at each degree of interaction: 1 8 (additive model)
## GCV 1802525892 RSS 3.632344e+12 GRSq 0.7208907 RSq 0.7252347
# RMSE
sqrt(mars_fit$fit$fit$fit$rss / nrow(ames_train))
## [1] 42103.92
# coefficients
mars_fit$fit$fit$fit$coefficients
## ..y
## (Intercept) 295628.31484
## h(Gr_Liv_Area-3395) -772.99460
## h(3395-Gr_Liv_Area) -61.66360
## h(2002-Year_Built) -685.97770
## h(Gr_Liv_Area-2169) 27.57111
## h(Gr_Liv_Area-3194) 516.04236
## h(Gr_Liv_Area-1456) 41.77324
## h(Year_Built-1977) 910.94873
## h(Year_Built-2004) 12519.67884
Next, lets go ahead and fit a full model to include all Ames housing features. As you’ll see in the results we experience significant improvement in the RMSE (low to mid $20K for both R and Python). However, recall that this is not a cross-validated RMSE, but rather the RMSE for the training data in which the model was fit. Consequently, this is likely an overfit model but we’ll perform a cross-validation approach in the next section to get a more accurate generalized RMSE.
MARS models are no different then other algorithms where we need to convert categorical features to numeric values. This can be handled differently in R and Python but in the examples that follow the categorical features are one-hot encoded. This leads to over 300 features in our dataset; however, when you look at the results you will see that only a fraction of these features are used in the trained model (Python uses 17 and R uses 29). However, there will actually be more coefficients then the features used because of the knots. For example of the 17 features used by Python, 27 coefficients are developed since many of these features are used multiple times (i.e. h(Gr_Liv_Area-3112)
, h(3112-Gr_Liv_Area)
).
We call this process of feature elimination and knot refinement as “pruning the knots.”
In Python we must convert our categorical features to numeric values. For this example we’ll simply dummy encode them using Pandas get_dummies()
.
# create new feature set with encoded features
X_train_encoded = pd.get_dummies(X_train)
# create model object
earth_mod = Earth()
# fit/train model
earth_fit_full = earth_mod.fit(X_train_encoded, y_train)
# results
print(earth_fit_full.summary())
## Earth Model
## --------------------------------------------------
## Basis Function Pruned Coefficient
## --------------------------------------------------
## (Intercept) No -177719
## h(Gr_Liv_Area-3112) No 59.5685
## h(3112-Gr_Liv_Area) No -60.3777
## h(Total_Bsmt_SF-2271) Yes None
## h(2271-Total_Bsmt_SF) No -1028.23
## Bsmt_Qual_Excellent No 18871.5
## Year_Built No 413.296
## h(Garage_Area-736) No 58.9552
## h(736-Garage_Area) No -23.0269
## h(Bsmt_Unf_SF-1065) No -21.4074
## h(1065-Bsmt_Unf_SF) No 18.7046
## h(Total_Bsmt_SF-2140) Yes None
## h(2140-Total_Bsmt_SF) No 775.194
## h(Total_Bsmt_SF-1922) Yes None
## h(1922-Total_Bsmt_SF) Yes None
## Kitchen_AbvGr No -33360.4
## Kitchen_Qual_Excellent No 24444.4
## Kitchen_Qual_Good No 11374.4
## h(Latitude-42.0474) Yes None
## h(42.0474-Latitude) No -171340
## h(Lot_Area-3701) No 0.599888
## h(3701-Lot_Area) No -10.3207
## Neighborhood_Crawford No 31814.9
## Overall_Qual_Very_Excellent No 112026
## Overall_Qual_Excellent No 66842.3
## Overall_Qual_Very_Good No 27634.3
## Condition_2_PosN No -141910
## Sale_Condition_Abnorml No -17915.4
## h(Year_Remod_Add-2009) No 29262.5
## h(2009-Year_Remod_Add) No -256.866
## h(Total_Bsmt_SF-1374) No -188.541
## h(1374-Total_Bsmt_SF) No 222.954
## --------------------------------------------------
## MSE: 597867287.3568, GCV: 638285663.6011, RSQ: 0.9086, GRSQ: 0.9025
# RMSE
np.sqrt(earth_fit_full.mse_)
## 24451.324858928252
In R, the MARS modeling engine “earth” will automatically one-hot encode all categorical features.
# Step 1: create ridge model object
mars_mod <- mars(mode = "regression")
# Step 2: create model & preprocessing recipe
model_recipe <- recipe(
Sale_Price ~ .,
data = ames_train
)
# Step 3: fit model workflow
mars_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(mars_mod) %>%
fit(data = ames_train)
# Step 4: results
mars_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Selected 41 of 45 terms, and 29 of 307 predictors
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: Gr_Liv_Area, Year_Built, Total_Bsmt_SF, ...
## Number of terms at each degree of interaction: 1 40 (additive model)
## GCV 511589986 RSS 967008439675 GRSq 0.9207836 RSq 0.9268516
# RMSE
sqrt(mars_fit$fit$fit$fit$rss / nrow(ames_train))
## [1] 21724.22
In addition to pruning the number of knots, MARS models allow us to also assess potential interactions between different hinge functions. The following example illustrates by allowing second degree interactions. You can see that now our model includes interaction terms between a maximum of two hinge functions (e.g. in the Python results you see h(Bsmt_Unf_SF-1065)*h(3112-Gr_Liv_Area)
which represents an interaction effect for those houses with more than 1,065 unfinished basement square footage but less than 3,112 above ground living space).
# create model object
earth_mod = Earth(max_degree=2)
# fit/train model
earth_fit_int = earth_mod.fit(X_train_encoded, y_train)
# results
print(earth_fit_int.summary())
## Earth Model
## -------------------------------------------------------------------
## Basis Function Pruned Coefficient
## -------------------------------------------------------------------
## (Intercept) No -2.29143e+06
## h(Gr_Liv_Area-3112) No 180.773
## h(3112-Gr_Liv_Area) No -79.6032
## h(Total_Bsmt_SF-2271) No -1485.52
## h(2271-Total_Bsmt_SF) No 1323.11
## Bsmt_Qual_Excellent No 26694.9
## Year_Built No 1116.61
## h(Total_Bsmt_SF-1930)*Year_Built No 0.747973
## h(1930-Total_Bsmt_SF)*Year_Built No -0.692553
## h(Bsmt_Unf_SF-1065)*h(3112-Gr_Liv_Area) No -0.0273563
## h(1065-Bsmt_Unf_SF)*h(3112-Gr_Liv_Area) No 0.0106462
## Land_Contour_Bnk*h(Gr_Liv_Area-3112) No -429.731
## h(Garage_Area-736)*Year_Built No 0.0922918
## h(736-Garage_Area)*Year_Built No -0.0440671
## Kitchen_AbvGr*Year_Built No -16.9025
## Exter_Qual_Excellent No 43826.8
## h(Lot_Area-3701)*Year_Built No 0.0153689
## h(3701-Lot_Area)*Year_Built Yes None
## h(Year_Remod_Add-2009)*Year_Built No 16.9883
## h(2009-Year_Remod_Add)*Year_Built No -0.19484
## Neighborhood_Crawford*h(2271-Total_Bsmt_SF) No 27.8665
## h(Garage_Area-736)*h(3112-Gr_Liv_Area) No -0.0902558
## h(736-Garage_Area)*h(3112-Gr_Liv_Area) No 0.0324932
## Functional_Typ*Year_Built No 10.1294
## h(Latitude-42.0503)*Year_Built Yes None
## h(42.0503-Latitude)*Year_Built No -108.044
## h(Lot_Area-3701) No -29.5457
## h(3701-Lot_Area) No -7.58563
## -------------------------------------------------------------------
## MSE: 597434761.6950, GCV: 636220318.1155, RSQ: 0.9086, GRSQ: 0.9028
# create ridge model object
mars_mod <- mars(mode = "regression", prod_degree = 2)
# create model & preprocessing recipe
model_recipe <- recipe(
Sale_Price ~ .,
data = ames_train
)
# fit model workflow
mars_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(mars_mod) %>%
fit(data = ames_train)
# coefficients
mars_fit$fit$fit$fit$coefficients
## ..y
## (Intercept) 3.473694e+05
## h(Gr_Liv_Area-3194) 2.302940e+02
## h(3194-Gr_Liv_Area) -7.005261e+01
## h(Year_Built-2002) 5.242461e+03
## h(2002-Year_Built) -7.360079e+02
## h(Total_Bsmt_SF-2223) 1.181093e+02
## h(2223-Total_Bsmt_SF) -4.901140e+01
## h(Year_Built-2002)*h(Gr_Liv_Area-2398) 9.035037e+00
## h(Year_Built-2002)*h(2398-Gr_Liv_Area) -3.382639e+00
## h(Bsmt_Unf_SF-625)*h(3194-Gr_Liv_Area) -1.145129e-02
## h(625-Bsmt_Unf_SF)*h(3194-Gr_Liv_Area) 9.905591e-03
## h(2002-Year_Built)*h(Total_Bsmt_SF-912) -7.674355e-01
## h(2002-Year_Built)*h(912-Total_Bsmt_SF) 2.277350e-01
## Mas_Vnr_TypeStone*h(Gr_Liv_Area-3194) -5.876451e+02
## h(Bedroom_AbvGr-4) -9.649772e+03
## h(4-Bedroom_AbvGr) 5.208836e+03
## Overall_QualVery_Excellent 1.095179e+05
## Overall_QualExcellent 7.165697e+04
## Overall_QualVery_Good 3.794092e+04
## h(Lot_Area-6720) 3.325029e+00
## h(6720-Lot_Area) -4.188039e+00
## h(2002-Year_Built)*h(Year_Remod_Add-1971) 7.085099e+00
## NeighborhoodCrawford*h(2002-Year_Built) 3.213659e+02
## h(Lot_Area-6720)*h(Garage_Cars-3) -1.943527e+00
## h(Lot_Area-6720)*h(3-Garage_Cars) -1.237131e+00
## NeighborhoodVeenker*Overall_QualExcellent -2.295300e+05
## Overall_QualGood 1.940091e+04
## Overall_QualAbove_Average*h(2223-Total_Bsmt_SF) 6.438535e+00
## h(2002-Year_Built)*Sale_ConditionNormal 2.183528e+02
## h(Lot_Area-6720)*h(144-Screen_Porch) -1.051737e-02
## FunctionalTyp 1.481969e+04
## h(Kitchen_AbvGr-1) -1.802740e+04
## h(First_Flr_SF-2444) -7.656120e+01
## h(2444-First_Flr_SF) -7.045054e+00
## NeighborhoodStone_Brook*h(Year_Built-2002) 1.033983e+04
## Overall_CondGood*h(2002-Year_Built) 1.598754e+02
## h(Longitude--93.6563) -9.367281e+05
## h(-93.6563-Longitude) -9.165180e+05
## h(3194-Gr_Liv_Area)*h(Longitude--93.6559) 4.432167e+02
## h(3194-Gr_Liv_Area)*h(-93.6559-Longitude) 3.803470e+02
## FunctionalTyp*h(Garage_Area-542) 3.500866e+01
## NeighborhoodGreen_Hills*FunctionalTyp 9.112140e+04
## h(4-Bedroom_AbvGr)*h(Fireplaces-1) 5.820943e+03
## h(4-Bedroom_AbvGr)*h(1-Fireplaces) -2.482522e+03
## h(Bsmt_Unf_SF-1237)*h(4-Bedroom_AbvGr) -3.737860e+01
There are two important tuning parameters associated with MARS models: the maximum degree of interactions and the number of terms retained in the final model. We need to perform a grid search to identify the optimal combination of these hyperparameters that minimize prediction error (the above pruning process was based only on an approximation of CV model performance on the training data rather than an exact k-fold CV process). As in previous modules, we’ll perform a CV grid search to identify the optimal hyperparameter mix. Below, we set up a grid that assesses different combinations of interaction complexity and the number of terms to retain in the final model.
Rarely is there any benefit in assessing greater than 3-rd degree interactions. Also, realize that tuning MARS models can become computationally intense. Its often beneficial to start with 10 evenly spaced values for the maximum terms and then you can always zoom in to a region once you find an approximate optimal solution.
In the examples that follow we see that the optimal models in both R and Python include 2 degree interactions and both are optimized by allowing around 30 terms.
In Python, our hyperparameters are max_terms
(maximum terms retained in the model) and max_degrees
(maximum degree of interaction allowed).
# create model object
earth_mod = Earth()
# define loss function
loss = 'neg_root_mean_squared_error'
# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)
# Create grid of hyperparameter values
hyper_grid = {'max_terms': range(1, 51, 3),
'max_degree': range(1, 3)}
grid_search = GridSearchCV(earth_mod, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train_encoded, y_train)
# Optimal penalty parameter in grid search
results.best_estimator_
## Earth(max_degree=2, max_terms=31)
# Best model's cross validated RMSE
round(abs(results.best_score_), 2)
## 32358.54
results_df = pd.DataFrame(results.cv_results_.get("params"))
results_df['RMSE'] = np.abs(results.cv_results_.get("mean_test_score"))
results_df['max_degree'] = results_df['max_degree'].astype(str)
(ggplot(results_df, aes('max_terms', 'RMSE', color='max_degree'))
+ geom_line()
+ geom_point()
)
In R, our hyperparameters are num_terms
(number of terms retained in the model) and prod_degrees
(maximum degree of interaction allowed). Note that in R we can use num_terms()
and prod_degree()
from the dials package to create a tuning grid. By default prod_degree()
will use values of 1 and 2 (no interactions and 2 degree interactions).
# create ridge model object
mars_mod <- mars(mode = "regression", num_terms = tune(), prod_degree = tune()) %>%
set_engine("earth")
# create model & preprocessing recipe
model_recipe <- recipe(
Sale_Price ~ .,
data = ames_train
)
# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)
# create our model recipe
model_recipe <- recipe(Sale_Price ~ ., data = ames_train)
# create a hyper parameter tuning grid
hyper_grid <- grid_regular(
num_terms(range = c(1, 100)),
prod_degree(),
levels = 50
)
# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(mars_mod, model_recipe, resamples = folds, grid = hyper_grid)
# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 37 2 rmse standard 26227. 5 3299. Preprocessor1_M…
## 2 35 2 rmse standard 26542. 5 3229. Preprocessor1_M…
## 3 45 2 rmse standard 26570. 5 3388. Preprocessor1_M…
## 4 47 2 rmse standard 26570. 5 3388. Preprocessor1_M…
## 5 49 2 rmse standard 26570. 5 3388. Preprocessor1_M…
autoplot(results)
MARS models use a backwards elimination feature selection routine that looks at reductions in the GCV estimate of error as each predictor is added to the model. This total reduction can be used as the variable importance measure ("gcv"
). Since MARS will automatically include and exclude terms during the pruning process, it essentially performs automated feature selection. If a predictor was never used in any of the MARS basis functions in the final model (after pruning), it has an importance value of zero. Alternatively, you can also monitor the change in the residual sums of squares (RSS) as terms are added ("rss"
); however, you will often see very little difference between these methods. The following examples extract the feature importance values based on RSS.
In both the R and Python examples we see very similar features as our most influential features (i.e. Gr_Liv_Area
, Year_Built
, Total_Bsmt_SF
). We can then use partial dependence plots (PDPs) to plot the change in the average predicted value (\(\hat{y}\)) as specified feature(s) vary over their marginal distribution. This essentially illustrates how our model predicts an output based on changes in a given feature. The examples that follow plot the PDP of Gr_Liv_Area
.
In Python, after we identify the model that performs best in our cross-validation procedure, we can re-train our model with those parameters and use feature_importance_type='rss'
to have our model record feature importance values. We can then extract this info and plot the most influential features.
best_mod = Earth(max_degree=2, max_terms=31, feature_importance_type='rss')
best_mod_fit = best_mod.fit(X_train_encoded, y_train)
vi = pd.DataFrame({'feature': best_mod_fit.xlabels_,
'importance': best_mod_fit.feature_importances_})
# get top 20 influential features
top_20_features = vi.nlargest(20, 'importance')
# plot feature importance
(ggplot(top_20_features, aes(x='importance', y='reorder(feature, importance)'))
+ geom_point()
+ labs(y=None))
We can then plot how our most influential features influence the predicted values. Here, you will notice that as Gr_Liv_Area
exceeds 3,000 square feet it has a stronger linear impact to our predicted Sales_Price
. Note that partial_dependence()
will automatically trim off the extreme values in our training data. If you supply percentiles=(0, 1)
as below it will use all values in the training data.
pd_results = partial_dependence(
best_mod_fit, X_train_encoded, "Gr_Liv_Area", kind='average',
percentiles=(0, 1))
pd_output = pd.DataFrame({'Gr_Liv_Area': pd_results['values'][0],
'yhat': pd_results['average'][0]})
(ggplot(pd_output, aes('Gr_Liv_Area', 'yhat'))
+ geom_line())
In R we can use the vip
package to extract our top 20 most important features based on the RSS measure.
# identify best model
lowest_rmse <- results %>%
select_best("rmse")
# extract best model and fit
final_model <- finalize_workflow(
workflow() %>% add_recipe(model_recipe) %>% add_model(mars_mod),
lowest_rmse) %>%
fit(ames_train)
# plot top 20 influential variables
final_model %>%
pull_workflow_fit() %>%
vip(20, type = "rss")
We can then use the pdp
package to extract the partial dependence values in order to plot. Note that for tidyverse models we need to create the custom prediction function to pass into pdp::parital()
.
# prediction function
pdp_pred_fun <- function(object, newdata) {
mean(predict(object, newdata, type = "numeric")$.pred)
}
# use the pdp package to extract partial dependence predictions
# and then plot
final_model %>%
pdp::partial(
pred.var = "Gr_Liv_Area",
pred.fun = pdp_pred_fun,
grid.resolution = 10,
train = ames_train
) %>%
ggplot(aes(Gr_Liv_Area, yhat)) +
geom_line() +
scale_y_continuous(labels = scales::dollar)
There are several advantages to MARS. First, MARS naturally handles mixed types of predictors (quantitative and qualitative). MARS considers all possible binary partitions of the categories for a qualitative predictor into two groups.1 Each group then generates a pair of piecewise indicator functions for the two categories. MARS also requires minimal feature engineering (e.g., feature scaling) and performs automated feature selection. For example, since MARS scans each predictor to identify a split that improves predictive accuracy, non-informative features will not be chosen. Furthermore, highly correlated predictors do not impede predictive accuracy as much as they do with OLS models.
However, one disadvantage to MARS models is that they’re typically slower to train. Since the algorithm scans each value of each predictor for potential cutpoints, computational performance can suffer as both \(n\) and \(p\) increase. Also, although correlated predictors do not necessarily impede model performance, they can make model interpretation difficult. When two features are nearly perfectly correlated, the algorithm will essentially select the first one it happens to come across when scanning the features. Then, since it randomly selected one, the correlated feature will likely not be included as it adds no additional explanatory power.
Using the hitters
dataset:
This is very similar to CART-like decision trees which you’ll be exposed to in a later module.↩︎