The previous modules discussed algorithms that are intrinsically linear. Many of these models can be adapted to nonlinear patterns in the data by manually adding nonlinear model terms (e.g., squared terms, interaction effects, and other transformations of the original features); however, to do so you the analyst must know the specific nature of the nonlinearities and interactions a priori. Alternatively, there are numerous algorithms that are inherently nonlinear. When using these models, the exact form of the nonlinearity does not need to be known explicitly or specified prior to model training. Rather, these algorithms will search for, and discover, nonlinearities and interactions in the data that help maximize predictive accuracy.

This module discusses multivariate adaptive regression splines (MARS) (Friedman 1991), an algorithm that automatically creates a piecewise linear model which provides an intuitive stepping block into nonlinearity after grasping the concept of multiple linear regression. Future modules will focus on other nonlinear algorithms.

Learning objectives

By the end of this module you will know how:

MARS models incorporates non-linear relationships with hinge functions (aka “knots”).
How to train and tune MARS models using R and/or Python.
How to identify and visualize the most influential features in MARS models.

Prerequisites

# Helper packages
import numpy as np
import pandas as pd
from plotnine import *

# Modeling packages
from pyearth import Earth
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.inspection import partial_dependence

# Ames housing data
ames = pd.read_csv("data/ames.csv")

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]

# Helper packages
library(tidyverse)   # for data wrangling & plotting

# Modeling packages
library(tidymodels)  # for fitting MARS models

# Model interpretability packages
library(vip)         # for variable importance
library(pdp)         # for variable relationships

# Stratified sampling with the rsample package
set.seed(123)
ames <- AmesHousing::make_ames()
split  <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- rsample::training(split)
ames_test   <- rsample::testing(split)

The basic idea

In the previous modules, we focused on linear models (where the analyst has to explicitly specify any nonlinear relationships and interaction effects). We illustrated some of the advantages of linear models such as their ease and speed of computation and also the intuitive nature of interpreting their coefficients. However, linear models make a strong assumption about linearity, and this assumption is often a poor one, which can affect predictive accuracy.

We can extend linear models to capture any non-linear relationship. Typically, this is done by explicitly including polynomial terms (e.g., $x_i^2$) or step functions. Polynomial regression is a form of regression in which the relationship between $X$ and $Y$ is modeled as a $d$th degree polynomial in $X$. For example, the following equation represents a polynomial regression function where $Y$ is modeled as a $d$-th degree polynomial in $X$. Generally speaking, it is unusual to use $d$ greater than 3 or 4 as the larger $d$ becomes, the easier the function fit becomes overly flexible and oddly shaped…especially near the boundaries of the range of $X$ values. Increasing $d$ also tends to increase the presence of multicollinearity.

\[\begin{equation} y_i = \beta_0 + \beta_1 x_i + \beta_2 x^2_i + \beta_3 x^3_i \dots + \beta_d x^d_i + \epsilon_i, \end{equation}\]

An alternative to polynomials is to use step functions. Whereas polynomial functions impose a global non-linear relationship, step functions break the range of $X$ into bins, and fit a simple constant (e.g., the mean response) in each. This amounts to converting a continuous feature into an ordered categorical variable such that our linear regression function is converted to the following equation

\[\begin{equation} y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + \beta_3 C_3(x_i) \dots + \beta_d C_d(x_i) + \epsilon_i, \end{equation}\]

where $C_1(x_i)$ represents $x_i$ values ranging from $c_1 \leq x_i < c_2$, $C_2\left(x_i\right)$ represents $x_i$ values ranging from $c_2 \leq x_i < c_3$, $\dots$, $C_d\left(x_i\right)$ represents $x_i$ values ranging from $c_{d-1} \leq x_i < c_d$. The figure below contrasts linear, polynomial, and step function fits for non-linear, non-monotonic simulated data.

Figure: Blue line represents predicted (y) values as a function of x for alternative approaches to modeling explicit nonlinear regression patterns. (A) Traditional linear regression approach does not capture any nonlinearity unless the predictor or response is transformed (i.e. log transformation). (B) Degree-2 polynomial, (C) Degree-3 polynomial, (D) Step function cutting x into six categorical levels.

Although useful, the typical implementation of polynomial regression and step functions require the user to explicitly identify and incorporate which variables should have what specific degree of interaction or at what points of a variable $X$ should cut points be made for the step functions. Considering many data sets today can easily contain 50, 100, or more features, this would require an enormous and unnecessary time commitment from an analyst to determine these explicit non-linear settings.

Multivariate adaptive regression splines

Multivariate adaptive regression splines (MARS) provide a convenient approach to capture the nonlinear relationships in the data by assessing cutpoints (knots) similar to step functions. The procedure assesses each data point for each predictor as a knot and creates a linear regression model with the candidate feature(s). For example, consider our non-linear, non-monotonic data above where $Y = f\left(X\right)$. The MARS procedure will first look for the single point across the range of X values where two different linear relationships between Y and X achieve the smallest error (e.g., smallest SSE). What results is known as a hinge function $h\left(x-a\right)$, where $a$ is the cutpoint value. For a single knot (Figure (A)), our hinge function is $h\left(\text{x}-1.183606\right)$ such that our two linear models for Y are

\[\begin{equation} \text{y} = \begin{cases} \beta_0 + \beta_1(1.183606 - \text{x}) & \text{x} < 1.183606, \\ \beta_0 + \beta_1(\text{x} - 1.183606) & \text{x} > 1.183606 \end{cases} \end{equation}\]

Once the first knot has been found, the search continues for a second knot which is found at $x = 4.898114$ (plot B in figure below). This results in three linear models for y:

\[\begin{equation} \text{y} = \begin{cases} \beta_0 + \beta_1(1.183606 - \text{x}) & \text{x} < 1.183606, \\ \beta_0 + \beta_1(\text{x} - 1.183606) & \text{x} > 1.183606 \quad \& \quad \text{x} < 4.898114, \\ \beta_0 + \beta_1(4.898114 - \text{x}) & \text{x} > 4.898114 \end{cases} \end{equation}\]

Figure: Examples of fitted regression splines of one (A), two (B), three (C), and four (D) knots.

This procedure continues until many knots are found, producing a (potentially) highly non-linear prediction equation. Although including many knots may allow us to fit a really good relationship with our training data, it may not generalize very well to new, unseen data. Consequently, once the full set of knots has been identified, we can sequentially remove knots that do not contribute significantly to predictive accuracy. This process is known as “pruning” and we can use cross-validation, as we have with the previous models, to find the optimal number of knots.

Fitting a MARS model

Fitting a basic model

We’ll start by fitting a basic model using Gr_Liv_Area and Year_Built to predict our Sales_Price. In both R and Python, the modeling algorithm will assess all potential knots across all supplied features and then will prune to the optimal number of knots based on an expected change in $R^2$ (for the training data) of less than 0.001. This calculation is performed by the Generalized cross-validation (GCV) procedure, which is a computational shortcut for linear models that produces an approximate leave-one-out cross-validation error metric (Golub, Heath, and Wahba 1979).

Note that MARS models do not require normalization or standardization of numeric features.

In both languages, the GCV is reported along with other metrics ($R^2$, residual sum of squares (R) and MSE (Python)). We can use this info to compute the RMSE, which is in the low to mid $40K range. Keep in mind that this RMSE is not a cross-validated RMSE, just the RMSe on the training data (we’ll do some cross-validation shortly).

You’ll notice that in both the R and Python examples we see a knot in Gr_Liv_Area at about 3500 (3395 in R and 3500 in Python). This suggests that as homes exceed 3500 square feet there is a change in the linear relationship between Gr_Liv_Area and Sale_Price compared to homes that are less than 3500 square feet. In both cases we see multiple knots across the range of Gr_Liv_Area and Year_Built. in fact, our model has taken two features (Gr_Liv_Area and Year_Built) and split them into 8-10 features (8 in R and 10 in Python) by creating knots along these feature values.

The most important tuning parameter in a MARS model is the number of knots to use for a given feature. However, this is automatically baked into the algorithm so we get this tuning for free!

The term “MARS” is trademarked and licensed exclusively to Salford Systems: https://www.salford-systems.com. We can use MARS as an abbreviation; however, it cannot be used for competing software solutions. This is why the R and Python packages use the name earth.

In Python we use the Earth() model provided by the pyearth library. pyearth is one of many packages that are considered scikit-learn compatible “plug-in” packages. You can read more about these compatible plug-in packages here.

# Step 1: create model object
earth_mod = Earth()

# Step 2: fit/train model
earth_fit = earth_mod.fit(X_train[["Gr_Liv_Area", "Year_Built"]], y_train)

# Step 3: results
print(earth_fit.summary())
## Earth Model
## ------------------------------------------
## Basis Function       Pruned  Coefficient  
## ------------------------------------------
## (Intercept)          No      392053       
## h(Gr_Liv_Area-3500)  No      -379.207     
## h(3500-Gr_Liv_Area)  No      -85.3651     
## h(Year_Built-2009)   No      142719       
## h(2009-Year_Built)   No      -1136.79     
## h(Gr_Liv_Area-1905)  No      46.5807      
## h(1905-Gr_Liv_Area)  Yes     None         
## h(Gr_Liv_Area-3140)  No      194.026      
## h(3140-Gr_Liv_Area)  Yes     None         
## h(Gr_Liv_Area-2798)  Yes     None         
## h(2798-Gr_Liv_Area)  Yes     None         
## ------------------------------------------
## MSE: 2114325265.1482, GCV: 2147703342.5369, RSQ: 0.6767, GRSQ: 0.6719

# RMSE
np.sqrt(earth_fit.mse_)
## 45981.7927570051

In R, we use the parsnip::mars() function which provides a default setting of using the earth package, which is the most popular package for MARS functionality in R.

# Step 1: create ridge model object
mars_mod <- mars(mode = "regression")

# Step 2: create model & preprocessing recipe
model_recipe <- recipe(
    Sale_Price ~ Gr_Liv_Area + Year_Built, 
    data = ames_train
  )
  
# Step 3: fit model workflow
mars_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(mars_mod) %>%
  fit(data = ames_train)

# Step 4: results
mars_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Selected 9 of 10 terms, and 2 of 2 predictors
## Termination condition: RSq changed by less than 0.001 at 10 terms
## Importance: Gr_Liv_Area, Year_Built
## Number of terms at each degree of interaction: 1 8 (additive model)
## GCV 1802525892    RSS 3.632344e+12    GRSq 0.7208907    RSq 0.7252347

# RMSE
sqrt(mars_fit$fit$fit$fit$rss / nrow(ames_train))
## [1] 42103.92

# coefficients
mars_fit$fit$fit$fit$coefficients
##                              ..y
## (Intercept)         295628.31484
## h(Gr_Liv_Area-3395)   -772.99460
## h(3395-Gr_Liv_Area)    -61.66360
## h(2002-Year_Built)    -685.97770
## h(Gr_Liv_Area-2169)     27.57111
## h(Gr_Liv_Area-3194)    516.04236
## h(Gr_Liv_Area-1456)     41.77324
## h(Year_Built-1977)     910.94873
## h(Year_Built-2004)   12519.67884

Fitting a full model

Next, lets go ahead and fit a full model to include all Ames housing features. As you’ll see in the results we experience significant improvement in the RMSE (low to mid $20K for both R and Python). However, recall that this is not a cross-validated RMSE, but rather the RMSE for the training data in which the model was fit. Consequently, this is likely an overfit model but we’ll perform a cross-validation approach in the next section to get a more accurate generalized RMSE.

MARS models are no different then other algorithms where we need to convert categorical features to numeric values. This can be handled differently in R and Python but in the examples that follow the categorical features are one-hot encoded. This leads to over 300 features in our dataset; however, when you look at the results you will see that only a fraction of these features are used in the trained model (Python uses 17 and R uses 29). However, there will actually be more coefficients then the features used because of the knots. For example of the 17 features used by Python, 27 coefficients are developed since many of these features are used multiple times (i.e. h(Gr_Liv_Area-3112), h(3112-Gr_Liv_Area)).

We call this process of feature elimination and knot refinement as “pruning the knots.”

In Python we must convert our categorical features to numeric values. For this example we’ll simply dummy encode them using Pandas get_dummies().

# create new feature set with encoded features
X_train_encoded = pd.get_dummies(X_train)

# create model object
earth_mod = Earth()

# fit/train model
earth_fit_full = earth_mod.fit(X_train_encoded, y_train)

# results
print(earth_fit_full.summary())
## Earth Model
## --------------------------------------------------
## Basis Function               Pruned  Coefficient  
## --------------------------------------------------
## (Intercept)                  No      -177719      
## h(Gr_Liv_Area-3112)          No      59.5685      
## h(3112-Gr_Liv_Area)          No      -60.3777     
## h(Total_Bsmt_SF-2271)        Yes     None         
## h(2271-Total_Bsmt_SF)        No      -1028.23     
## Bsmt_Qual_Excellent          No      18871.5      
## Year_Built                   No      413.296      
## h(Garage_Area-736)           No      58.9552      
## h(736-Garage_Area)           No      -23.0269     
## h(Bsmt_Unf_SF-1065)          No      -21.4074     
## h(1065-Bsmt_Unf_SF)          No      18.7046      
## h(Total_Bsmt_SF-2140)        Yes     None         
## h(2140-Total_Bsmt_SF)        No      775.194      
## h(Total_Bsmt_SF-1922)        Yes     None         
## h(1922-Total_Bsmt_SF)        Yes     None         
## Kitchen_AbvGr                No      -33360.4     
## Kitchen_Qual_Excellent       No      24444.4      
## Kitchen_Qual_Good            No      11374.4      
## h(Latitude-42.0474)          Yes     None         
## h(42.0474-Latitude)          No      -171340      
## h(Lot_Area-3701)             No      0.599888     
## h(3701-Lot_Area)             No      -10.3207     
## Neighborhood_Crawford        No      31814.9      
## Overall_Qual_Very_Excellent  No      112026       
## Overall_Qual_Excellent       No      66842.3      
## Overall_Qual_Very_Good       No      27634.3      
## Condition_2_PosN             No      -141910      
## Sale_Condition_Abnorml       No      -17915.4     
## h(Year_Remod_Add-2009)       No      29262.5      
## h(2009-Year_Remod_Add)       No      -256.866     
## h(Total_Bsmt_SF-1374)        No      -188.541     
## h(1374-Total_Bsmt_SF)        No      222.954      
## --------------------------------------------------
## MSE: 597867287.3568, GCV: 638285663.6011, RSQ: 0.9086, GRSQ: 0.9025

# RMSE
np.sqrt(earth_fit_full.mse_)
## 24451.324858928252

In R, the MARS modeling engine “earth” will automatically one-hot encode all categorical features.

# Step 1: create ridge model object
mars_mod <- mars(mode = "regression")

# Step 2: create model & preprocessing recipe
model_recipe <- recipe(
    Sale_Price ~ ., 
    data = ames_train
  )
  
# Step 3: fit model workflow
mars_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(mars_mod) %>%
  fit(data = ames_train)

# Step 4: results
mars_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Selected 41 of 45 terms, and 29 of 307 predictors
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: Gr_Liv_Area, Year_Built, Total_Bsmt_SF, ...
## Number of terms at each degree of interaction: 1 40 (additive model)
## GCV 511589986    RSS 967008439675    GRSq 0.9207836    RSq 0.9268516

# RMSE
sqrt(mars_fit$fit$fit$fit$rss / nrow(ames_train))
## [1] 21724.22

Fitting a full model with interactions

In addition to pruning the number of knots, MARS models allow us to also assess potential interactions between different hinge functions. The following example illustrates by allowing second degree interactions. You can see that now our model includes interaction terms between a maximum of two hinge functions (e.g. in the Python results you see h(Bsmt_Unf_SF-1065)*h(3112-Gr_Liv_Area) which represents an interaction effect for those houses with more than 1,065 unfinished basement square footage but less than 3,112 above ground living space).

# create model object
earth_mod = Earth(max_degree=2)

# fit/train model
earth_fit_int = earth_mod.fit(X_train_encoded, y_train)

# results
print(earth_fit_int.summary())
## Earth Model
## -------------------------------------------------------------------
## Basis Function                               Pruned  Coefficient   
## -------------------------------------------------------------------
## (Intercept)                                  No      -2.29143e+06  
## h(Gr_Liv_Area-3112)                          No      180.773       
## h(3112-Gr_Liv_Area)                          No      -79.6032      
## h(Total_Bsmt_SF-2271)                        No      -1485.52      
## h(2271-Total_Bsmt_SF)                        No      1323.11       
## Bsmt_Qual_Excellent                          No      26694.9       
## Year_Built                                   No      1116.61       
## h(Total_Bsmt_SF-1930)*Year_Built             No      0.747973      
## h(1930-Total_Bsmt_SF)*Year_Built             No      -0.692553     
## h(Bsmt_Unf_SF-1065)*h(3112-Gr_Liv_Area)      No      -0.0273563    
## h(1065-Bsmt_Unf_SF)*h(3112-Gr_Liv_Area)      No      0.0106462     
## Land_Contour_Bnk*h(Gr_Liv_Area-3112)         No      -429.731      
## h(Garage_Area-736)*Year_Built                No      0.0922918     
## h(736-Garage_Area)*Year_Built                No      -0.0440671    
## Kitchen_AbvGr*Year_Built                     No      -16.9025      
## Exter_Qual_Excellent                         No      43826.8       
## h(Lot_Area-3701)*Year_Built                  No      0.0153689     
## h(3701-Lot_Area)*Year_Built                  Yes     None          
## h(Year_Remod_Add-2009)*Year_Built            No      16.9883       
## h(2009-Year_Remod_Add)*Year_Built            No      -0.19484      
## Neighborhood_Crawford*h(2271-Total_Bsmt_SF)  No      27.8665       
## h(Garage_Area-736)*h(3112-Gr_Liv_Area)       No      -0.0902558    
## h(736-Garage_Area)*h(3112-Gr_Liv_Area)       No      0.0324932     
## Functional_Typ*Year_Built                    No      10.1294       
## h(Latitude-42.0503)*Year_Built               Yes     None          
## h(42.0503-Latitude)*Year_Built               No      -108.044      
## h(Lot_Area-3701)                             No      -29.5457      
## h(3701-Lot_Area)                             No      -7.58563      
## -------------------------------------------------------------------
## MSE: 597434761.6950, GCV: 636220318.1155, RSQ: 0.9086, GRSQ: 0.9028

# create ridge model object
mars_mod <- mars(mode = "regression", prod_degree = 2)

# create model & preprocessing recipe
model_recipe <- recipe(
    Sale_Price ~ ., 
    data = ames_train
  )
  
# fit model workflow
mars_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(mars_mod) %>%
  fit(data = ames_train)

# coefficients
mars_fit$fit$fit$fit$coefficients
##                                                           ..y
## (Intercept)                                      3.473694e+05
## h(Gr_Liv_Area-3194)                              2.302940e+02
## h(3194-Gr_Liv_Area)                             -7.005261e+01
## h(Year_Built-2002)                               5.242461e+03
## h(2002-Year_Built)                              -7.360079e+02
## h(Total_Bsmt_SF-2223)                            1.181093e+02
## h(2223-Total_Bsmt_SF)                           -4.901140e+01
## h(Year_Built-2002)*h(Gr_Liv_Area-2398)           9.035037e+00
## h(Year_Built-2002)*h(2398-Gr_Liv_Area)          -3.382639e+00
## h(Bsmt_Unf_SF-625)*h(3194-Gr_Liv_Area)          -1.145129e-02
## h(625-Bsmt_Unf_SF)*h(3194-Gr_Liv_Area)           9.905591e-03
## h(2002-Year_Built)*h(Total_Bsmt_SF-912)         -7.674355e-01
## h(2002-Year_Built)*h(912-Total_Bsmt_SF)          2.277350e-01
## Mas_Vnr_TypeStone*h(Gr_Liv_Area-3194)           -5.876451e+02
## h(Bedroom_AbvGr-4)                              -9.649772e+03
## h(4-Bedroom_AbvGr)                               5.208836e+03
## Overall_QualVery_Excellent                       1.095179e+05
## Overall_QualExcellent                            7.165697e+04
## Overall_QualVery_Good                            3.794092e+04
## h(Lot_Area-6720)                                 3.325029e+00
## h(6720-Lot_Area)                                -4.188039e+00
## h(2002-Year_Built)*h(Year_Remod_Add-1971)        7.085099e+00
## NeighborhoodCrawford*h(2002-Year_Built)          3.213659e+02
## h(Lot_Area-6720)*h(Garage_Cars-3)               -1.943527e+00
## h(Lot_Area-6720)*h(3-Garage_Cars)               -1.237131e+00
## NeighborhoodVeenker*Overall_QualExcellent       -2.295300e+05
## Overall_QualGood                                 1.940091e+04
## Overall_QualAbove_Average*h(2223-Total_Bsmt_SF)  6.438535e+00
## h(2002-Year_Built)*Sale_ConditionNormal          2.183528e+02
## h(Lot_Area-6720)*h(144-Screen_Porch)            -1.051737e-02
## FunctionalTyp                                    1.481969e+04
## h(Kitchen_AbvGr-1)                              -1.802740e+04
## h(First_Flr_SF-2444)                            -7.656120e+01
## h(2444-First_Flr_SF)                            -7.045054e+00
## NeighborhoodStone_Brook*h(Year_Built-2002)       1.033983e+04
## Overall_CondGood*h(2002-Year_Built)              1.598754e+02
## h(Longitude--93.6563)                           -9.367281e+05
## h(-93.6563-Longitude)                           -9.165180e+05
## h(3194-Gr_Liv_Area)*h(Longitude--93.6559)        4.432167e+02
## h(3194-Gr_Liv_Area)*h(-93.6559-Longitude)        3.803470e+02
## FunctionalTyp*h(Garage_Area-542)                 3.500866e+01
## NeighborhoodGreen_Hills*FunctionalTyp            9.112140e+04
## h(4-Bedroom_AbvGr)*h(Fireplaces-1)               5.820943e+03
## h(4-Bedroom_AbvGr)*h(1-Fireplaces)              -2.482522e+03
## h(Bsmt_Unf_SF-1237)*h(4-Bedroom_AbvGr)          -3.737860e+01

Tuning

There are two important tuning parameters associated with MARS models: the maximum degree of interactions and the number of terms retained in the final model. We need to perform a grid search to identify the optimal combination of these hyperparameters that minimize prediction error (the above pruning process was based only on an approximation of CV model performance on the training data rather than an exact k-fold CV process). As in previous modules, we’ll perform a CV grid search to identify the optimal hyperparameter mix. Below, we set up a grid that assesses different combinations of interaction complexity and the number of terms to retain in the final model.

Rarely is there any benefit in assessing greater than 3-rd degree interactions. Also, realize that tuning MARS models can become computationally intense. Its often beneficial to start with 10 evenly spaced values for the maximum terms and then you can always zoom in to a region once you find an approximate optimal solution.

In the examples that follow we see that the optimal models in both R and Python include 2 degree interactions and both are optimized by allowing around 30 terms.

In Python, our hyperparameters are max_terms (maximum terms retained in the model) and max_degrees (maximum degree of interaction allowed).

# create model object
earth_mod = Earth()

# define loss function
loss = 'neg_root_mean_squared_error'

# create 5 fold CV object
kfold = KFold(n_splits=5, random_state=123, shuffle=True)

# Create grid of hyperparameter values
hyper_grid = {'max_terms': range(1, 51, 3),
              'max_degree': range(1, 3)}

grid_search = GridSearchCV(earth_mod, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train_encoded, y_train)

# Optimal penalty parameter in grid search
results.best_estimator_
## Earth(max_degree=2, max_terms=31)

# Best model's cross validated RMSE
round(abs(results.best_score_), 2)
## 32358.54

results_df = pd.DataFrame(results.cv_results_.get("params"))
results_df['RMSE'] = np.abs(results.cv_results_.get("mean_test_score"))
results_df['max_degree'] = results_df['max_degree'].astype(str)

(ggplot(results_df, aes('max_terms', 'RMSE', color='max_degree'))
  + geom_line()
  + geom_point()
  )

In R, our hyperparameters are num_terms (number of terms retained in the model) and prod_degrees (maximum degree of interaction allowed). Note that in R we can use num_terms() and prod_degree() from the dials package to create a tuning grid. By default prod_degree() will use values of 1 and 2 (no interactions and 2 degree interactions).

# create ridge model object
mars_mod <- mars(mode = "regression", num_terms = tune(), prod_degree = tune()) %>%
 set_engine("earth")

# create model & preprocessing recipe
model_recipe <- recipe(
    Sale_Price ~ ., 
    data = ames_train
  )
  
# create k-fold cross validation object
folds <- vfold_cv(ames_train, v = 5)

# create our model recipe
model_recipe <- recipe(Sale_Price ~ ., data = ames_train)

# create a hyper parameter tuning grid
hyper_grid <- grid_regular(
 num_terms(range = c(1, 100)), 
 prod_degree(),
 levels = 50
 )

# train our model across the hyper parameter grid
set.seed(123)
results <- tune_grid(mars_mod, model_recipe, resamples = folds, grid = hyper_grid)

# get best results
show_best(results, metric = "rmse")
## # A tibble: 5 x 8
##   num_terms prod_degree .metric .estimator   mean     n std_err .config         
##       <int>       <int> <chr>   <chr>       <dbl> <int>   <dbl> <chr>           
## 1        37           2 rmse    standard   26227.     5   3299. Preprocessor1_M…
## 2        35           2 rmse    standard   26542.     5   3229. Preprocessor1_M…
## 3        45           2 rmse    standard   26570.     5   3388. Preprocessor1_M…
## 4        47           2 rmse    standard   26570.     5   3388. Preprocessor1_M…
## 5        49           2 rmse    standard   26570.     5   3388. Preprocessor1_M…

autoplot(results)

Feature interpretation

MARS models use a backwards elimination feature selection routine that looks at reductions in the GCV estimate of error as each predictor is added to the model. This total reduction can be used as the variable importance measure ("gcv"). Since MARS will automatically include and exclude terms during the pruning process, it essentially performs automated feature selection. If a predictor was never used in any of the MARS basis functions in the final model (after pruning), it has an importance value of zero. Alternatively, you can also monitor the change in the residual sums of squares (RSS) as terms are added ("rss"); however, you will often see very little difference between these methods. The following examples extract the feature importance values based on RSS.

In both the R and Python examples we see very similar features as our most influential features (i.e. Gr_Liv_Area, Year_Built, Total_Bsmt_SF). We can then use partial dependence plots (PDPs) to plot the change in the average predicted value ($\hat{y}$) as specified feature(s) vary over their marginal distribution. This essentially illustrates how our model predicts an output based on changes in a given feature. The examples that follow plot the PDP of Gr_Liv_Area.

In Python, after we identify the model that performs best in our cross-validation procedure, we can re-train our model with those parameters and use feature_importance_type='rss' to have our model record feature importance values. We can then extract this info and plot the most influential features.

best_mod = Earth(max_degree=2, max_terms=31, feature_importance_type='rss')
best_mod_fit = best_mod.fit(X_train_encoded, y_train)

vi = pd.DataFrame({'feature': best_mod_fit.xlabels_,
              'importance': best_mod_fit.feature_importances_})

# get top 20 influential features
top_20_features = vi.nlargest(20, 'importance')

# plot feature importance
(ggplot(top_20_features, aes(x='importance', y='reorder(feature, importance)'))
 + geom_point()
 + labs(y=None))

We can then plot how our most influential features influence the predicted values. Here, you will notice that as Gr_Liv_Area exceeds 3,000 square feet it has a stronger linear impact to our predicted Sales_Price. Note that partial_dependence() will automatically trim off the extreme values in our training data. If you supply percentiles=(0, 1) as below it will use all values in the training data.

pd_results = partial_dependence(
  best_mod_fit, X_train_encoded, "Gr_Liv_Area", kind='average',
  percentiles=(0, 1)) 
  
pd_output = pd.DataFrame({'Gr_Liv_Area': pd_results['values'][0],
                          'yhat': pd_results['average'][0]})
                          
(ggplot(pd_output, aes('Gr_Liv_Area', 'yhat'))
  + geom_line())

In R we can use the vip package to extract our top 20 most important features based on the RSS measure.

# identify best model
lowest_rmse <- results %>%
  select_best("rmse")

# extract best model and fit
final_model <- finalize_workflow(
  workflow() %>% add_recipe(model_recipe) %>% add_model(mars_mod), 
  lowest_rmse) %>%
  fit(ames_train)

# plot top 20 influential variables
final_model %>%
  pull_workflow_fit() %>% 
  vip(20, type = "rss")

We can then use the pdp package to extract the partial dependence values in order to plot. Note that for tidyverse models we need to create the custom prediction function to pass into pdp::parital().

# prediction function
pdp_pred_fun <- function(object, newdata) {
  mean(predict(object, newdata, type = "numeric")$.pred)
}

# use the pdp package to extract partial dependence predictions
# and then plot
final_model %>%
  pdp::partial(
   pred.var = "Gr_Liv_Area", 
   pred.fun = pdp_pred_fun,
   grid.resolution = 10, 
   train = ames_train
  ) %>%
  ggplot(aes(Gr_Liv_Area, yhat)) +
  geom_line() +
  scale_y_continuous(labels = scales::dollar)

Final thoughts

There are several advantages to MARS. First, MARS naturally handles mixed types of predictors (quantitative and qualitative). MARS considers all possible binary partitions of the categories for a qualitative predictor into two groups.¹ Each group then generates a pair of piecewise indicator functions for the two categories. MARS also requires minimal feature engineering (e.g., feature scaling) and performs automated feature selection. For example, since MARS scans each predictor to identify a split that improves predictive accuracy, non-informative features will not be chosen. Furthermore, highly correlated predictors do not impede predictive accuracy as much as they do with OLS models.

However, one disadvantage to MARS models is that they’re typically slower to train. Since the algorithm scans each value of each predictor for potential cutpoints, computational performance can suffer as both $n$ and $p$ increase. Also, although correlated predictors do not necessarily impede model performance, they can make model interpretation difficult. When two features are nearly perfectly correlated, the algorithm will essentially select the first one it happens to come across when scanning the features. Then, since it randomly selected one, the correlated feature will likely not be included as it adds no additional explanatory power.

Exercises

Using the hitters dataset:

Apply a MARS model with all features.
How does the model performance compare to your previous models?
How many of the features are influential? Which 10 features are considered most influential?
Does your model include hinge functions? If so, explain their coefficient and plot their impact on the predicted response variable.
Does your model include interactions? If so, pick the interaction effect that is most influential and explain the coefficient.

🏠

References

Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics, 1–67.

Golub, Gene H, Michael Heath, and Grace Wahba. 1979. “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” Technometrics 21 (2): 215–23.

This is very similar to CART-like decision trees which you’ll be exposed to in a later module.↩︎