class: misk-title-slide <br><br><br><br><br> # .font150[Regularized Regression] --- # Prerequisites .pull-left[ .center.bold.font120[Packages] ```r # Helper packages library(recipes) # for feature engineering library(tidyverse) # general data munging # Modeling packages library(glmnet) # for implementing regularized regression library(caret) # for automating the tuning process library(rsample) # for sampling # Model interpretability packages library(vip) # for variable importance ``` ] .pull-right[ .center.bold.font120[Data] ```r # ames data ames <- AmesHousing::make_ames() # split data set.seed(123) split <- initial_split(ames, strata = "Sale_Price") ames_train <- training(split) ``` ] --- # The Idea .font120[As *p* grows larger, there are three main issues we most commonly run into:] 1. Multicollinearity (we've already seen how PCR & PLS help to resolve this) 2. Insufficient solution ( `\(p >> n\)` ) 3. Interpretability - Approach 1: model selection - computationally inefficient (Ames data: `\(2^{80}\)` models to evaluate) - simply assume a feature as in or out `\(\rightarrow\)` _hard threshholding_ - Approach 2: regularize - retain all coefficients - slowly pushes a feature's effect towards zero `\(\rightarrow\)` _soft threshholding_ -- <br> .center.bold.blue[Regularization helps with all three of these issues!] --- # Regular regression <br> `\begin{equation} \text{minimize} \bigg \{ SSE = \sum^n_{i=1} (y_i - \hat{y}_i)^2 \bigg \} \end{equation}` <img src="06-regularized-regression-slides_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Regular.red[ized] regression <br> `\begin{equation} \text{minimize} \big \{ SSE + P \big \} \end{equation}` <br> Modify OLS objective function by adding a ___.red[P]enalty___ parameter - Constrains magnitude of the coefficients - Progressively shrinks coefficients to zero - Reduces variability of coefficients (pulls correlated coefficients together) - Can automate feature selection .center.bold.blue[There are 3 variants of regularized regression] --- # .red[Ridge] regression .pull-left[ Objective function: `\begin{equation} \text{minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} \beta_j^2 \bigg \} \end{equation}` * referred to as `\(L_2\)` penalty * pulls correlated features towards each other * pushes coefficients to .red[near zero] * retains .red[all] features ] .pull-right[ <img src="06-regularized-regression-slides_files/figure-html/ridge-coef-example-1.png" style="display: block; margin: auto;" /> <img src="images/lambda.001.png" width="1753" style="display: block; margin: auto;" /> ] --- # .red[Lasso] regression .pull-left[ Objective function: `\begin{equation} \text{minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} | \beta_j | \bigg \} \end{equation}` * referred to as `\(L_1\)` penalty * pulls correlated features towards each other * pushes coefficients to .red[zero] * performs .red[automated feature selection] ] .pull-right[ <img src="06-regularized-regression-slides_files/figure-html/lasso-coef-example-1.png" style="display: block; margin: auto;" /> <img src="images/lambda.001.png" width="1753" style="display: block; margin: auto;" /> ] --- # .red[Elastic net] regression .pull-left[ Objective function: `\begin{equation} \text{minimize } \bigg \{ SSE + \lambda_1 \sum^p_{j=1} \beta_j^2 + \lambda_2 \sum^p_{j=1} | \beta_j | \bigg \} \end{equation}` * combines `\(L_1\)` & `\(L_2\)` penalties * provides best of both worlds ] .pull-right[ <img src="06-regularized-regression-slides_files/figure-html/elastic-net-coef-example-1.png" style="display: block; margin: auto;" /> <img src="images/lambda.001.png" width="1753" style="display: block; margin: auto;" /> ] --- # Tuning .pull-left[ * .bold[lambda] - controls the magnitude of the penalty parameter - rule of
<img src="https://emojis.slackmojis.com/emojis/images/1511903783/3230/wiggle_thumbs_up.gif?1511903783" style="height:1em; width:auto; "/>
: 0.1, 10, 100, 1000, 10000 * .bold[alpha] - controls the type of penalty (ridge, lasso, elastic net) - rule of
<img src="https://emojis.slackmojis.com/emojis/images/1511903783/3230/wiggle_thumbs_up.gif?1511903783" style="height:1em; width:auto; "/>
: 0, .25, .50, .75, 1 ] .pull-right[ <br> .center[.bold[Tip]: find tuning parameters with:] ```r caret::getModelInfo("glmnet")$glmnet$parameters ## parameter class label ## 1 alpha numeric Mixing Percentage ## 2 lambda numeric Regularization Parameter ``` .center[Here, "glmnet" represents the __caret__ method we are going to use] ] --- # R packages 📦 .pull-left[ ## [`glmnet`](https://cran.r-project.org/package=glmnet) * original implementation of regularized regression in R * linear regression, logistic and multinomial regression models, Poisson regression and the Cox model * extremely efficient procedures for fitting the entire lasso or elastic-net regularization path ] .pull-right[ ## [h2o](https://cran.r-project.org/package=h2o) 💧 * java-based interface * Automated feature pre-processing & validation procedures * Supports the following distributions: “guassian”, “binomial”, “multinomial”, “ordinal”, “poisson”, “gamma”, “tweedie” ] .center.bold[Other options exist (see __Regularized and Shrinkage Methods__ section of [Machine Learning task view](https://CRAN.R-project.org/view=MachineLearning )) but these are the preferred.] --- # Data prep .pull-left[ * glmnet only accepts the non-formula XY interface so prior to modeling we need to separate our feature and target sets and * dummy encode our feature set ] .pull-right[ ```r # Create training feature matrices # we use model.matrix(...)[, -1] to discard the intercept X <- model.matrix(Sale_Price ~ ., ames_train)[, -1] # transform y with log transformation Y <- log(ames_train$Sale_Price) ``` ] --- # glmnet .bold[Pro Tip]: glmnet can auto-generate the appropriate λ values based on the data; the vast majority of the time you will have little need to adjust this default. .scrollable90[ .pull-left[ .center.bold[Ridge] ```r ridge <- glmnet( x = X, y = Y, alpha = 0 ) plot(ridge, xvar = "lambda") ``` <img src="06-regularized-regression-slides_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] .pull-right[ .center.bold[Lasso] ```r lasso <- glmnet( x = X, y = Y, alpha = 1 ) plot(lasso, xvar = "lambda") ``` <img src="06-regularized-regression-slides_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] ] --- # glmnet * So which one is better? -- * We can use `cv.glmnet` to provide cross-validated results .scrollable90[ .pull-left[ .center.bold[Ridge] ```r ridge <- cv.glmnet( x = X, y = Y, alpha = 0 ) plot(ridge) ``` <img src="06-regularized-regression-slides_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] .pull-right[ .center.bold[Lasso] ```r lasso <- cv.glmnet( x = X, y = Y, alpha = 1 ) plot(lasso) ``` <img src="06-regularized-regression-slides_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] ] --- # glmnet * So which one is better? * We can use `cv.glmnet` to provide cross-validated results * The results are similar but the lasso model provides feature selection --> allows us to focus on only 64 features rather than 296! .code70.scrollable90[ .pull-left[ .center.bold[Ridge] ```r # Ridge model - minimum MSE min(ridge$cvm) ## [1] 0.0193355 # Ridge model - lambda for this min MSE ridge$lambda.min ## [1] 0.1674023 # Ridge model w/1-SE rule ridge$cvm[ridge$lambda == ridge$lambda.1se] ## [1] 0.02073208 # Ridge model w/1-SE rule -- No. of coef | 1-SE MSE ridge$nzero[ridge$lambda == ridge$lambda.1se] ## s69 ## 296 ``` ] .pull-right[ .center.bold[Lasso] ```r # Lasso model - minimum MSE min(lasso$cvm) ## [1] 0.02028048 # Lasso model - lambda for this min MSE lasso$lambda.min ## [1] 0.001880471 # Lasso model - w/1-SE rule lasso$cvm[lasso$lambda == lasso$lambda.1se] ## [1] 0.02278907 # Lasso model w/1-SE rule -- No. of coef | 1-SE MSE lasso$nzero[lasso$lambda == lasso$lambda.1se] ## s35 ## 62 ``` ] ] --- # Grid search Often, the optimal model contains an alpha somewhere between 0–1, thus we want to tune both the λ and the alpha parameters. .scrollable90[ .pull-left[ ```r # tuning grid hyper_grid <- expand.grid( alpha = seq(0, 1, by = .25), lambda = c(0.1, 10, 100, 1000, 10000) ) # perform resampling set.seed(123) cv_glmnet <- train( x = X, y = Y, method = "glmnet", preProc = c("zv", "center", "scale"), trControl = trainControl(method = "cv", number = 10), tuneLength = 10 ) # best model cv_glmnet$results %>% filter( alpha == cv_glmnet$bestTune$alpha, lambda == cv_glmnet$bestTune$lambda ) ## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD ## 1 0.1 0.02007101 0.1330084 0.8918911 0.08157713 0.0234813 0.03649193 0.004197479 ``` ] .pull-right[ ```r # plot results plot(cv_glmnet) ``` <img src="06-regularized-regression-slides_files/figure-html/cv-glmnet-plot-1.png" style="display: block; margin: auto;" /> ] ] --- # Comparing results to previous models .pull-left[ * So how does this compare to our previous best model for the Ames data set? * Keep in mind that for this module we log transformed the response variable (`Sale_Price`). * Consequently, to provide a fair comparison to our previously model(s) we need to re-transform our predicted values. ] .pull-right[ ```r # predict sales price on training data pred <- predict(cv_glmnet, X) # compute RMSE of transformed predicted RMSE(exp(pred), exp(Y)) ## [1] 21768.32 ``` ] --- # Feature interpretation ```r vip(cv_glmnet, num_features = 20, geom = "point") ``` <img src="06-regularized-regression-slides_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Feature interpretation <img src="06-regularized-regression-slides_files/figure-html/regularized-top4-pdp-1.png" style="display: block; margin: auto;" /> --- class: clear, center, middle, hide-logo background-image: url(images/any-questions.jpg) background-position: center background-size: cover --- # Back home <br><br><br><br> [.center[
<i class="fas fa-home fa-10x faa-FALSE animated "></i>
]](https://github.com/misk-data-science/misk-homl) .center[https://github.com/misk-data-science/misk-homl]