Regularized Regression

class: misk-title-slide

# .font150[Regularized Regression]

---
# Prerequisites

.pull-left[

.center.bold.font120[Packages]

```r
# Helper packages
library(recipes)   # for feature engineering
library(tidyverse) # general data munging

# Modeling packages
library(glmnet)   # for implementing regularized regression
library(caret)    # for automating the tuning process
library(rsample)  # for sampling

# Model interpretability packages
library(vip)      # for variable importance
```

]

.pull-right[

.center.bold.font120[Data]

```r
# ames data
ames <- AmesHousing::make_ames()
# split data
set.seed(123)
split <- initial_split(ames, strata = "Sale_Price")
ames_train <- training(split)
```

]

---
# The Idea

.font120[As *p* grows larger, there are three main issues we most commonly run into:]

1. Multicollinearity (we've already seen how PCR & PLS help to resolve this)

2. Insufficient solution ( `$p >> n$` )

3. Interpretability
   - Approach 1: model selection
      - computationally inefficient (Ames data: `$2^{80}$` models to evaluate)
      - simply assume a feature as in or out `$\rightarrow$` _hard threshholding_
   - Approach 2: regularize
      - retain all coefficients
      - slowly pushes a feature's effect towards zero `$\rightarrow$` _soft threshholding_
   
--

.center.bold.blue[Regularization helps with all three of these issues!]

---
# Regular regression

`\begin{equation}
\text{minimize} \bigg \{ SSE = \sum^n_{i=1} (y_i - \hat{y}_i)^2 \bigg \}
\end{equation}`

---
# Regular.red[ized] regression

`\begin{equation}
\text{minimize} \big \{ SSE + P \big \}
\end{equation}`

Modify OLS objective function by adding a ___.red[P]enalty___ parameter

- Constrains magnitude of the coefficients

- Progressively shrinks coefficients to zero

- Reduces variability of coefficients (pulls correlated coefficients together)

- Can automate feature selection

.center.bold.blue[There are 3 variants of regularized regression]

---
# .red[Ridge] regression

.pull-left[
Objective function:

`\begin{equation}
\text{minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} \beta_j^2 \bigg \}
\end{equation}`

* referred to as `$L_2$` penalty

* pulls correlated features towards each other

* pushes coefficients to .red[near zero]

* retains .red[all] features

]

.pull-right[

]

---
# .red[Lasso] regression

.pull-left[
Objective function:

`\begin{equation}
\text{minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} | \beta_j | \bigg \}
\end{equation}`

* referred to as `$L_1$` penalty

* pulls correlated features towards each other

* pushes coefficients to .red[zero]

* performs .red[automated feature selection]

]

.pull-right[

]

---
# .red[Elastic net] regression

.pull-left[
Objective function:

`\begin{equation}
\text{minimize } \bigg \{ SSE + \lambda_1 \sum^p_{j=1} \beta_j^2 + \lambda_2 \sum^p_{j=1} | \beta_j | \bigg \}
\end{equation}`

* combines `$L_1$` & `$L_2$` penalties

* provides best of both worlds

]

.pull-right[

]

---
# Tuning

.pull-left[

* .bold[lambda]
 - controls the magnitude of the penalty parameter
 - rule of <img src="https://emojis.slackmojis.com/emojis/images/1511903783/3230/wiggle_thumbs_up.gif?1511903783" style="height:1em; width:auto; "/>: 0.1, 10, 100, 1000, 10000

* .bold[alpha]
 - controls the type of penalty (ridge, lasso, elastic net)
 - rule of <img src="https://emojis.slackmojis.com/emojis/images/1511903783/3230/wiggle_thumbs_up.gif?1511903783" style="height:1em; width:auto; "/>: 0, .25, .50, .75, 1

]

.pull-right[

.center[.bold[Tip]: find tuning parameters with:]

```r
caret::getModelInfo("glmnet")$glmnet$parameters
##   parameter   class                    label
## 1     alpha numeric        Mixing Percentage
## 2    lambda numeric Regularization Parameter
```

.center[Here, "glmnet" represents the __caret__ method we are going to use]

]

---
# R packages 📦

.pull-left[

## [`glmnet`](https://cran.r-project.org/package=glmnet)

* original implementation of regularized regression in R

* linear regression, logistic and multinomial regression models, Poisson regression and the Cox model

* extremely efficient procedures for fitting the entire lasso or elastic-net regularization path

]
.pull-right[

## [h2o](https://cran.r-project.org/package=h2o) 💧

* java-based interface

* Automated feature pre-processing & validation procedures

* Supports the following distributions: “guassian”, “binomial”, “multinomial”, “ordinal”, “poisson”, “gamma”, “tweedie”
    
]

.center.bold[Other options exist (see __Regularized and Shrinkage Methods__ section of [Machine Learning task view](https://CRAN.R-project.org/view=MachineLearning
)) but these are the preferred.]

---
# Data prep

.pull-left[

* glmnet only accepts the non-formula XY interface so prior to modeling we need to separate our feature and target sets and

* dummy encode our feature set

]

.pull-right[

```r
# Create training feature matrices
# we use model.matrix(...)[, -1] to discard the intercept
X <- model.matrix(Sale_Price ~ ., ames_train)[, -1]

# transform y with log transformation
Y <- log(ames_train$Sale_Price)
```
]
---
# glmnet

.bold[Pro Tip]: glmnet can auto-generate the appropriate λ values based on the data; the vast majority of the time you will have little need to adjust this default.

.scrollable90[
.pull-left[

.center.bold[Ridge]

```r
ridge <- glmnet(
 x = X,
 y = Y,
 alpha = 0
)

plot(ridge, xvar = "lambda")
```

]

.pull-right[

.center.bold[Lasso]

```r
lasso <- glmnet(
 x = X,
 y = Y,
 alpha = 1
)

plot(lasso, xvar = "lambda")
```

]
]

---
# glmnet

* So which one is better?

* We can use `cv.glmnet` to provide cross-validated results

.scrollable90[
.pull-left[

.center.bold[Ridge]

```r
ridge <- cv.glmnet(
 x = X,
 y = Y,
 alpha = 0
)

plot(ridge)
```

]

.pull-right[

.center.bold[Lasso]

```r
lasso <- cv.glmnet(
 x = X,
 y = Y,
 alpha = 1
)

plot(lasso)
```

]
]

---
# glmnet

* So which one is better?
* We can use `cv.glmnet` to provide cross-validated results
* The results are similar but the lasso model provides feature selection --> allows us to focus on only 64 features rather than 296!

.code70.scrollable90[
.pull-left[

.center.bold[Ridge]

```r
# Ridge model - minimum MSE
min(ridge$cvm)
## [1] 0.0193355

# Ridge model - lambda for this min MSE
ridge$lambda.min 
## [1] 0.1674023

# Ridge model w/1-SE rule
ridge$cvm[ridge$lambda == ridge$lambda.1se]
## [1] 0.02073208

# Ridge model w/1-SE rule -- No. of coef | 1-SE MSE
ridge$nzero[ridge$lambda == ridge$lambda.1se]
## s69 
## 296
```

]

.pull-right[

.center.bold[Lasso]

```r
# Lasso model - minimum MSE
min(lasso$cvm)       
## [1] 0.02028048

# Lasso model - lambda for this min MSE
lasso$lambda.min 
## [1] 0.001880471

# Lasso model - w/1-SE rule
lasso$cvm[lasso$lambda == lasso$lambda.1se]
## [1] 0.02278907

# Lasso model w/1-SE rule -- No. of coef | 1-SE MSE
lasso$nzero[lasso$lambda == lasso$lambda.1se]
## s35 
##  62
```

]
]

---
# Grid search

Often, the optimal model contains an alpha somewhere between 0–1, thus we want to tune both the λ and the alpha parameters.

.scrollable90[
.pull-left[

```r
# tuning grid
hyper_grid <- expand.grid(
 alpha = seq(0, 1, by = .25),
 lambda = c(0.1, 10, 100, 1000, 10000)
)

# perform resampling
set.seed(123)
cv_glmnet <- train(
 x = X,
 y = Y,
 method = "glmnet",
 preProc = c("zv", "center", "scale"),
 trControl = trainControl(method = "cv", number = 10),
 tuneLength = 10
)

# best model
cv_glmnet$results %>%
  filter(
    alpha == cv_glmnet$bestTune$alpha,
    lambda == cv_glmnet$bestTune$lambda
    )
##   alpha     lambda      RMSE  Rsquared        MAE    RMSESD RsquaredSD       MAESD
## 1   0.1 0.02007101 0.1330084 0.8918911 0.08157713 0.0234813 0.03649193 0.004197479
```
]

.pull-right[

```r
# plot results
plot(cv_glmnet)
```

]
]

---
# Comparing results to previous models

.pull-left[

* So how does this compare to our previous best model for the Ames data set?

* Keep in mind that for this module we log transformed the response variable (`Sale_Price`).

* Consequently, to provide a fair comparison to our previously model(s) we need to re-transform our predicted values.

]

.pull-right[

```r
# predict sales price on training data
pred <- predict(cv_glmnet, X)

# compute RMSE of transformed predicted
RMSE(exp(pred), exp(Y))
## [1] 21768.32
```

]

---
# Feature interpretation

```r
vip(cv_glmnet, num_features = 20, geom = "point")
```

---
# Feature interpretation

---
class: clear, center, middle, hide-logo

background-image: url(images/any-questions.jpg)
background-position: center
background-size: cover

---
# Back home

[.center[]](https://github.com/misk-data-science/misk-homl)

.center[https://github.com/misk-data-science/misk-homl]