Gradient Boosting Machines

class: misk-title-slide

# .font140[Gradient Boosting Machines]

---
# Introduction

.pull-left[

.center.bold.font120[Thoughts]

- Extremely popular

- One of the leading methods in prediction competitions

- Boosted trees similar to, but quite different than, RFs <img src="images/headpound_bunny.gif" style="height:1.5em; width:auto; "/>

- Math isn't that complicated until you want to generalize to all loss functions

]

.pull-right[

.center.bold.font120[Overview]

- Fundamental differences between RFs and GBMs

- Basic GBM

- Stochastic GBM

- XGBoost

]

---
# Prereqs .red[ code chunk 1]

.pull-left[

.center.bold.font120[Packages]

```r
library(gbm)
library(xgboost)
library(vip)
library(pdp)
```

]

.pull-right[

.center.bold.font120[Data]

```r
# ames data
ames <- AmesHousing::make_ames()

# split data
set.seed(123)
split <- rsample::initial_split(ames, strata = "Sale_Price")
ames_train <- rsample::training(split)
```

]

---
class: misk-section-slide

.bold.font250[Technicalities]

---
# Decision Trees

.pull-left[

* Many benefits <img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>
 - .green[minimal preprocessing]
 - .green[can handle any data type]
 - .green[automatically captures interactions]
 - .green[scales well to large data]
 - .green[(can be) easy to interpret]
 
* A few significant weaknesses <img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1em; width:auto; "/> 
 - .red[large trees hard to interpret]
 - .red[trees are step functions] (i.e., binary splits)
 - .red[single trees typically have poor predictive accuracy]
 - .red[single trees have high variance] (easy to overfit to training data)

]

.pull-right[

]

---
# Bagging

.pull-left[

* Benefits <img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>
 - .green[takes advantage of a deep, single tree's high variance]
 - .green[wisdom of the crowd reduces prediction error]
 - .green[fast (typically only requires 50-100 trees)]

* Weaknesses <img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1em; width:auto; "/> 
 - .red[tree correlation]
 - .red[minimizes tree diversity and, therefore,]
 - .red[limited prediction error improvement ]

]

.pull-right[

]

---
# Random Forests

.pull-left[

* Many benefits <img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>
 - .green[all the benefits of individual trees and bagging plus...]
 - .green[split-variable randomization reduces tree correlation]
 - .green[typically results in reduced prediction error compared to bagging]
 - .green[good out-of-box performance]
 
* Weaknesses <img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1em; width:auto; "/> 
 - .red[Although accurate, often cannot compete with the accuracy of advanced boosting algorithms.]
 - .red[Can become slow on large data sets.]

]

.pull-right[

]

---
# How boosting works

.pull-left[

The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far.

]

.pull-right[

]

---
# How boosting works

.pull-left[

The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new .blue.bold[weak], base-learner model is trained with respect to the error of the whole ensemble learnt so far.

]

.pull-right[

A weak model:

* one whose error rate is only slightly better than random guessing

* each step slightly improves the remaining errors

* commonly, trees with only 1-6 splits are used

* Benefits of weak models
   - speed
   - accuracy improvement
   - can avoid overfitting

]

---
# How boosting works

.pull-left[

The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, .blue.bold[base-learner model] is trained with respect to the error of the whole ensemble learnt so far.

]

.pull-right[

Base-learning models:

* boosting is a framework that iteratively improves any weak learning model

* many gradient boosting applications allow you to “plug in” various classes of weak learners at your disposal

* in practice however, boosted algorithms almost always use decision trees as the base-learner

]

---
# How boosting works

.pull-left[

The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is .blue.bold[trained with respect to the error] of the whole ensemble learnt so far.

]

.pull-right[

Sequential training with respect to errors:

* boosted trees are grown sequentially; each tree is grown using information from previously grown trees.

1. Fit a decision tree to the data: `$F_1(x) = y$`,
   2. We then fit the next decision tree to the residuals of the previous: `$h_1(x) = y - F_1(x)$`,
   3. Add this new tree to our algorithm: `$F_2(x) = F_1(x) + h_1(x)$`,
   4. Fit the next decision tree to the residuals of `$F_2$`: `$h_2(x) = y - F_2(x)$`,
   5. Add this new tree to our algorithm: `$F_3(x) = F_2(x) + h_1(x)$`,
   6. Continue this process until some mechanism (i.e. cross validation) tells us to stop.

]

---
# How boosting works

We call this sequential training .blue.bold[additive model ensembling] where each iteration gradually nudges our predicted values closer to the target.

.pull-left[

$$
`\begin{aligned}
 \hat y & = f_0(x) + \triangle_1(x) + \triangle_2(x) + \cdots + \triangle_M(x)  \\
        & = f_0(x) + \sum^M_{m=1} \triangle_m(x) \\
        & = F_m(x)
\end{aligned}`
$$

Also written as...

$$
`\begin{aligned}
 F_0(x) & = f_0(x) \\
 F_m(x) & = F_{m-1}(x) + \triangle_m(x)
\end{aligned}`
$$

]

.pull-right[

.font60.right[Image: [Terence Parr & Jeremy Howard](https://explained.ai/gradient-boosting/L2-loss.html)]

]

---
# How boosting works

.pull-left[

]

.pull-right[

]

---
# Boosting > Random Forest > Bagging > Single Tree

.pull-left[

.center.font120.blue[Typically, this allows us to eek out additional predictive performance!]

]

.pull-right[

]

---
class: misk-section-slide

.bold.font250[Basic GBM]

---
# Basic GBM

.pull-left[

.bold.font110[[gbm](https://github.com/gbm-developers/gbm)]
- The original R implementation of GMBs (by Greg Ridgeway)
- Slower than modern implementations (but still pretty fast)
- Provides OOB error estimate
- Supports the weighted tree traversal method for fast construction of PDPs

.bold.font110[[gbm3](https://github.com/gbm-developers/gbm3)]
- Shiny new version of gbm that is not backwards compatible
- Faster and supports parallel tree building
- Not currently listed on CRAN

]

---
# Basic GBM .red[ code chunk 2]

.pull-left[

.opacity20[   
.bold.font110[[gbm3](https://github.com/gbm-developers/gbm3)]
- Shiny new version of gbm that is not backwards compatible
- Faster and supports parallel tree building
- Not currently listed on CRAN
]
]

.pull-right[
.center.bold.font90[Let's run your first GBM model]

```r
set.seed(123)
ames_gbm <- gbm(
 formula = Sale_Price ~ .,
 data = ames_train,
 distribution = "gaussian", # or bernoulli, multinomial, etc. 
 n.trees = 5000, 
 shrinkage = 0.1, 
 interaction.depth = 1, 
 n.minobsinnode = 10, 
 cv.folds = 5 
 )

# find index for n trees with minimum CV error
min_MSE <- which.min(ames_gbm$cv.error)

# get MSE and compute RMSE
sqrt(ames_gbm$cv.error[min_MSE])
## [1] 26825.21
```

.center.bold.font90[ This grid search takes ~30 secs ]

]

---
# What's going on?

.pull-left.font90[

* .bold[`distribution`]: specify distribution of response variable; `gbm` will make intelligent guess

* .bold[`n.trees`]: number of sequential trees to fit

* .bold[`shrinkage`]: how quickly do we improve on each iteration (aka _learning rate_)

* .bold[`interaction.depth`]: how weak of a learner do we want

* .bold[`n.minobsinnode`]: minimum number of observations in the trees terminal nodes

* .bold[`cv.folds`]: _k_-fold cross validation

]

.pull-right[

.opacity20.center.bold.font90[Let's run your first GBM model]

```r
set.seed(123)
ames_gbm <- gbm(
 formula = Sale_Price ~ .,
 data = ames_train,
* distribution = "gaussian", # or bernoulli, multinomial, etc.
* n.trees = 5000,
* shrinkage = 0.1,
* interaction.depth = 1,
* n.minobsinnode = 10,
* cv.folds = 5
 )

# find index for n trees with minimum CV error
min_MSE <- which.min(ames_gbm$cv.error)

# get MSE and compute RMSE
sqrt(ames_gbm$cv.error[min_MSE])
## [1] 26825.21
```

.opacity20.center.bold.font90[ This grid search takes ~30 secs ]

]

---
# What's going on?

.pull-left.font90[

.bold.center[Tunable Hyperparameters]

* .opacity20[`distribution`: specify distribution of response variable; `gbm` will make intelligent guess]

* .bold[`n.trees`]: number of sequential trees to fit

* .bold[`shrinkage`]: how quickly do we improve on each iteration (aka _learning rate_)

* .bold[`interaction.depth`]: how weak of a learner do we want

* .bold[`n.minobsinnode`]: minimum number of observations in the trees terminal nodes

* .opacity20[`cv.folds`: _k_-fold cross validation]

]

.pull-right[

.opacity20.center.bold.font90[Let's run your first GBM model]

# find index for n trees with minimum CV error
min_MSE <- which.min(ames_gbm$cv.error)

# get MSE and compute RMSE
sqrt(ames_gbm$cv.error[min_MSE])
## [1] 26825.21
```

]

---
# Tuning

In contrast to Random Forests, GBMs .bold.red[do not] provide good "out-of-the-" performance!

We can divide hyperparameters into 2 primary categories:

.pull-left[

.center.bold[Boosting Parameters]

- Number of trees

- Learning rate

- More to come!

]

.pull-right[

.center.bold[Tree-specific Parameters]

- Tree depth

- Minimum obs in terminal node

- And others

]

---
# Boosting hyperparameters

.pull-left[

.blue.bold[Number of trees]

- The averaging in bagging and RF makes it very difficult to overfit with too many trees

- GBMs will chase residuals as long as you allow them to

- Consequently:
   - We must provide enough trees to minimize error
   - But not too many where we begin to overfit

]

.pull-right[

]

---
# Boosting hyperparameters .red[ code chunk 3]

.pull-left[

.blue.bold[Number of trees]

- The averaging in bagging and RF makes it very difficult to overfit with too many trees

- GBMs will chase residuals as long as you allow them to

- Consequently:
   - We must provide enough trees to minimize error
   - But not too many where we begin to overfit
   - .red[plus, number of trees is dependent on other hyperparameters]

.center.bold.blue[Use CV or OOB]

]

.pull-right[

```r
gbm.perf(ames_gbm, method = "cv") # or "OOB"
```

```
## [1] 1550
```

.center.bold.blue[Use CV or OOB]

]

---
# Boosting hyperparameters

.pull-left[

.blue.bold[Learning rate] (aka shrinkage)

.font120[
- Determines the impact of each tree on the final outcome
]

]

.pull-right[

]

---
# Boosting hyperparameters

.pull-left[

.blue.bold[Learning rate] (aka shrinkage)

- Determines the impact of each tree on the final outcome

- .red[Too large of a learning rate will have poor predictive capability]

- Lower values are generally preferred:
   - .green[they make the model robust to the specific characteristics of tree and thus allowing it to generalize well]
   - .green[easier to stop prior to overfitting]
   - .red[but run the risk of not reaching the optimum]
   - .red[are more computationally demanding]

]

.pull-right[

]

---
# Boosting hyperparameters

.pull-left[

.blue.bold[Learning rate] (aka shrinkage)

- Determines the impact of each tree on the final outcome

- .red[Too large of a learning rate will have poor predictive capability]

- Lower values are generally preferred (.01 - .1):
   - .green[they make the model robust to the specific characteristics of tree and thus allowing it to generalize well]
   - .green[easier to stop prior to overfitting]
   - .red[but run the risk of not reaching the optimum]
   - .red[are more computationally demanding]
   - .bold[Requires more trees!]

]

.pull-right[

]

---
# Tree-specific hyperparameters

.pull-left[

.blue.bold[Tree depth]

- controls over-fitting
- higher depth captures unique interactions
- but runs risk of over-fitting
- smaller depth (i.e. stumps) are computationally efficient (but .bold[require more trees!])
- typical values: 3-8
 - larger _n_ or _p_ are more tolerable to values

.blue.bold[Min obs in terminal nodes]

- controls over-fitting
- higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree
- typically have small impact on performance
- smaller values can help with imbalanced classes

]

.pull-right[

]

---
# Tuning strategy

.font120[
1. Choose a relatively high learning rate. Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems

2. Determine the optimum number of trees for this learning rate.

3. Tune learning rate and assess speed vs. performance

4. Tune tree-specific parameters for decided learning rate and number of trees.

5. Lower the learning rate and increase the estimators proportionally to get more robust models.
]

---

# Tuning strategy .red[ code chunk 4]

.scrollable90[
.pull-left[
 
.font110[
1. fix tree hyperparameters
 - moderate tree depth
 - default min obs
2. set our learning rate at .01
3. increase CV to ensure unbiased error estimate
4. Results
 - Lowest error rate yet ($ 22,609.14)!
 - Used nearly all our trees `$\rightarrow$` increase to 6000?
 - took `$\approx$` 2.25 min
5. Compared to learning rate of .001
 - error rate of $26,952.78
 - took `$\approx$` 3 min
]
]

.pull-right[

.center.bold.font90[ This model run takes ~2 mins ]

```r
set.seed(123)
ames_gbm1 <- gbm(
 formula = Sale_Price ~ .,
 data = ames_train,
* distribution = "gaussian", # or bernoulli, multinomial, etc.
* n.trees = 5000,
* shrinkage = 0.01,
* interaction.depth = 3,
* n.minobsinnode = 10,
* cv.folds = 10
 )

# find index for n trees with minimum CV error
min_MSE <- which.min(ames_gbm1$cv.error)

# get MSE and compute RMSE
sqrt(ames_gbm1$cv.error[min_MSE])
## [1] 22609.14

gbm.perf(ames_gbm1, method = "cv")
```

```
## [1] 4994
```

]
]

---
# Tuning strategy .red[ code chunk 5]

.scrollable90[
.pull-left[

Now let's tune the tree-specific hyperparameters

* we could do it in `caret` but lets use functional programming

* assess 3 values for tree depth

* assess 3 values for min obs in terminal node

]

.pull-right[
.center[ ]
.center.font90.bold[This grid search takes ~30 mins; remember, I said the ML process is more of a marathon than a sprint!!]
.center[ ]

```r
# search grid
hyper_grid <- expand.grid(
 n.trees = 6000,
 shrinkage = .01,
* interaction.depth = c(3, 5, 7),
* n.minobsinnode = c(5, 10, 15)
)

model_fit <- function(n.trees, shrinkage, interaction.depth, n.minobsinnode) {
 set.seed(123)
 m <- gbm(
 formula = Sale_Price ~ .,
 data = ames_train,
 distribution = "gaussian",
 n.trees = n.trees,
* shrinkage = shrinkage,
* interaction.depth = interaction.depth,
 n.minobsinnode = n.minobsinnode,
 cv.folds = 10
 )
 # compute RMSE
 sqrt(min(m$cv.error))
}

hyper_grid$rmse <- pmap_dbl(
 hyper_grid,
 ~ model_fit(
 n.trees = ..1,
 shrinkage = ..2,
 interaction.depth = ..3,
 n.minobsinnode = ..4
 )
)

arrange(hyper_grid, rmse)
##   n.trees shrinkage interaction.depth n.minobsinnode     rmse
## 1    6000      0.01                 7              5 21835.03
## 2    6000      0.01                 5             10 22030.60
## 3    6000      0.01                 5              5 22036.20
## 4    6000      0.01                 5             15 22049.48
## 5    6000      0.01                 7             10 22077.24
## 6    6000      0.01                 3             10 22397.88
## 7    6000      0.01                 3             15 22411.68
## 8    6000      0.01                 7             15 22455.38
## 9    6000      0.01                 3              5 22525.67
```

]]

---
class: misk-section-slide

.bold.font250[Stochastic GBM]

---
# Adding randomness

- Friedman (1999) [](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf) introduced stochastic gradient boosting

- A big insight into bagging ensembles and random forest was allowing trees to be created from random subsamples of the training dataset minimizes tree correlation among sequential trees

- Improves computational time since we're reducing *n*

- A few variants of stochastic boosting that can be used:
   - Subsample rows before creating each tree (__gbm__)
   - Subsample columns before creating each tree (__h2o__ & __xgboost__)
   - Subsample columns before considering each split (__h2o__ & __xgboost__)

- .blue.bold[Pro tip]: Generally, aggressive sub-sampling such as selecting only 50% of the data has shown to be beneficial. Typical values: 0.5-0.8

---
# Applying .red[ code chunk 6]

.pull-left[

- start by assessing if values between 0.5-0.8 outperform your previous best model

- zoom in with a second round of tuning

- smaller values will tell you that overfitting was occurring

]

.pull-right[

```r
*bag_frac <- c(.5, .65, .8)

for(i in bag_frac) {
 set.seed(123)
 m <- gbm(
 formula = Sale_Price ~ .,
 data = ames_train,
 distribution = "gaussian",
 n.trees = 6000, 
 shrinkage = 0.01, 
 interaction.depth = 7, 
 n.minobsinnode = 5,
* bag.fraction = i,
 cv.folds = 10 
 )
 # compute RMSE
 print(sqrt(min(m$cv.error)))
}
## [1] 21835.03
## [1] 21688.16
## [1] 22064.78
```

]

---
class: misk-section-slide

.bold.font250[Extreme Gradient Boosting]

---
# XGBoost Advantage

Extreme Gradient boosting (XGBoost) provides a few advantages over traditional boosting:

- .bold[Regularization]: Standard GBM implementation has no regularization like XGBoost; helps to reduce overfitting.

- .bold[Parallel Processing]: GPU and Spark compatible

- .bold[Loss functions]: allows users to define custom optimization objectives and evaluation criteria

- .bold[Tree pruning]: splits up to the max depth specified and then pruning; uses the weakest learner required

- .bold[Early stopping]: stop model assessment when additional trees offer no improvement

- .bold[Continue existing model]: User can start training an XGBoost model from its last iteration of previous run

.center.bold.blue[Super powerful...super <img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1em; width:auto; "/> awesome!]

---
# Prereqs .red[ code chunk 7]

.pull-left[

* __xgboost__ requires that our features are one-hot encoded

* __caret__ and __h2o::h2o.xgboost__ can automate this for you

* In this preprocessing I:
   - collapse low frequency levels to "other"
   - convert ordered factors to integers (aka label encode)

]

.pull-right[

```r
library(recipes)
xgb_prep <- recipe(Sale_Price ~ ., data = ames_train) %>%
 step_other(all_nominal(), threshold = .005) %>%
 step_integer(all_nominal()) %>%
 prep(training = ames_train, retain = TRUE) %>%
 juice()

X <- as.matrix(xgb_prep[setdiff(names(xgb_prep), "Sale_Price")])
Y <- xgb_prep$Sale_Price
```

]

.center.bold[.blue[Pro tip:] If you have I cardinality categorical features, label or ordinal encoding often improves performance and speed!]

---
# First XGBoost model .red[ code chunk 8]

.pull-left.font90[

* .bold[`nrounds`]: 6,000 trees

* .bold[`objective`]: `reg:linear` for regression but other options exist (i.e. `reg:logistic`, `binary:logistic`, `num_class`)

* .bold[`early_stopping_rounds`]: stop training if CV RMSE doesn't improve for 50 trees in a row

* .bold[`nfold`]: 10-fold CV

.center.bold[What's up with the results <img src="https://emojis.slackmojis.com/emojis/images/1542340469/4974/notinterested.gif" style="height:2.5em; width:auto; "/>!!!]

]

.pull-right[

.center.bold.font90[ This grid search takes ~20 secs ]

```r
set.seed(123)
ames_xgb <- xgb.cv(
 data = X,
 label = Y,
 nrounds = 5000,
 objective = "reg:linear",
 early_stopping_rounds = 50, 
 nfold = 10,
 verbose = 0,
 )

ames_xgb$evaluation_log %>% tail()
##    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
## 1:  104        2129.095       154.6692       24304.04      3428.906
## 2:  105        2093.833       160.5231       24300.29      3429.206
## 3:  106        2058.820       147.4346       24296.37      3428.053
## 4:  107        2015.093       140.3247       24299.15      3426.006
## 5:  108        1990.135       141.1260       24295.42      3427.259
## 6:  109        1971.045       142.7896       24293.17      3425.749
```

]

---
# Tuning .red[ code chunk 9]

.pull-left.font110[
 
1. Crank up the trees and tune learning rate with early stopping
 - initial test RMSE results:
 - .red[`eta = .3` (default): 24,246 w/59 trees (< 1 min)]
 - .red[`eta = .1`: 23,353 w/365 trees (< 1 min)]
 - .green[`eta = .05`: 22,835 w/658 trees (1.5 min)]
 - .red[`eta = .01`: 22,854 w/2359 trees (4 min)]
 
 
.center.font80[As a comparison, if you one-hot encoded the feature set it takes 30 mins to run with `eta = .01`!]

]

.pull-right[

.center.bold.font90[ This grid search takes ~1.5 min ]

```r
set.seed(123)
ames_xgb <- xgb.cv(
 data = X,
 label = Y,
 nrounds = 6000,
 objective = "reg:linear",
 early_stopping_rounds = 50, 
 nfold = 10,
 verbose = 0,
* params = list(eta = .05)
 )

ames_xgb$evaluation_log %>% tail()
##    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
## 1:  703        1939.203       85.80847       22838.30      4117.835
## 2:  704        1934.107       84.71774       22838.41      4117.534
## 3:  705        1929.225       84.02225       22838.51      4117.648
## 4:  706        1924.938       84.76338       22838.19      4117.490
## 5:  707        1921.622       84.35032       22838.57      4117.288
## 6:  708        1916.764       85.21769       22839.00      4116.994
```
]

.scrollable90[
.pull-left.font110[
 
1. .opacity[Crank up the trees and tune learning rate with early stopping]
2. Tune tree-specific hyperparameters
 - tree depth
 - instances required to make additional split

* Preferred values: 
   - `max_depth` = 3
   - `min_child_weight` = 1
   - RMSE = 22457.38
]

.pull-right[

.center.bold.font90[ This grid search takes ~30 min ]

```r
# grid
hyper_grid <- expand.grid(
 eta = .05,
* max_depth = c(1, 3, 5, 7, 9),
* min_child_weight = c(1, 3, 5, 7, 9),
 rmse = 0 # a place to dump results
 )

arrange(hyper_grid, rmse)
##     eta max_depth min_child_weight     rmse
## 1  0.05         3                1 22457.38
## 2  0.05         3                5 22692.19
## 3  0.05         3                3 22852.76
## 4  0.05         5                1 23052.25
## 5  0.05         5                5 23065.48
## 6  0.05         3                7 23190.42
## 7  0.05         3                9 23243.66
## 8  0.05         5                7 23308.51
## 9  0.05         5                9 23375.43
## 10 0.05         7                1 23446.60
## 11 0.05         5                3 23466.38
## 12 0.05         9                1 23604.30
## 13 0.05         7                5 23844.86
## 14 0.05         7                9 23900.27
## 15 0.05         7                7 23932.79
## 16 0.05         7                3 23970.08
## 17 0.05         9                5 23983.13
## 18 0.05         9                3 24062.60
## 19 0.05         9                7 24102.38
## 20 0.05         9                9 24137.39
## 21 0.05         1                1 26114.08
## 22 0.05         1                3 26310.89
## 23 0.05         1                9 26451.40
## 24 0.05         1                7 26476.04
## 25 0.05         1                5 27237.43
```

]
]

.scrollable90[
.pull-left.font110[
 
1. .opacity[Crank up the trees and tune learning rate with early stopping]
2. .opacity[Tune tree-specific hyperparameters]
3. Add stochastic attributes with
 - subsampling rows for each tree
 - subsampling columns for each tree

* Preferred values: 
   - `subsample` = 1
   - `colsample_bytree` = 0.65
   - RMSE = 22206.60
]

.pull-right[

.center.bold.font90[ This grid search takes ~12 min ]

```r
# grid
hyper_grid <- expand.grid(
 eta = .05,
 max_depth = 3, 
 min_child_weight = 1,
* subsample = c(.5, .65, .8, 1),
* colsample_bytree = c(.5, .65, .8, 1),
 rmse = 0 # a place to dump results
 )

# grid search
for(i in seq_len(nrow(hyper_grid))) {
 set.seed(123)
 m <- xgb.cv(
 data = X,
 label = Y,
 nrounds = 6000,
 objective = "reg:linear",
 early_stopping_rounds = 50, 
 nfold = 10,
 verbose = 0,
* params = list(
 eta = hyper_grid$eta[i],
 max_depth = hyper_grid$max_depth[i],
 min_child_weight = hyper_grid$min_child_weight[i],
* subsample = hyper_grid$subsample[i],
* colsample_bytree = hyper_grid$colsample_bytree[i]
* )
 )
 hyper_grid$rmse[i] <- min(m$evaluation_log$test_rmse_mean)
}

arrange(hyper_grid, rmse)
##     eta max_depth min_child_weight subsample colsample_bytree     rmse
## 1  0.05         3                1      1.00             0.65 22206.60
## 2  0.05         3                1      0.80             0.65 22267.11
## 3  0.05         3                1      0.65             0.80 22287.02
## 4  0.05         3                1      1.00             1.00 22457.38
## 5  0.05         3                1      0.80             0.50 22464.28
## 6  0.05         3                1      1.00             0.80 22479.92
## 7  0.05         3                1      0.80             0.80 22481.30
## 8  0.05         3                1      0.65             0.65 22540.13
## 9  0.05         3                1      0.80             1.00 22569.57
## 10 0.05         3                1      0.65             0.50 22616.20
## 11 0.05         3                1      0.50             0.80 22654.22
## 12 0.05         3                1      0.50             0.50 22692.35
## 13 0.05         3                1      0.65             1.00 22814.75
## 14 0.05         3                1      0.50             0.65 22834.63
## 15 0.05         3                1      1.00             0.50 23261.14
## 16 0.05         3                1      0.50             1.00 23416.40
```

]
]

---
# Tuning

.font110[
 
1. .opacity[Crank up the trees and tune learning rate with early stopping]
2. .opacity[Tune tree-specific hyperparameters]
3. .opacity[Add stochastic attributes with]
4. See if adding regularization helps
 - gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree (values dependent on loss function)
 - lambda: `$L_2$` (ridge) regularizer on weights of trees. Decent values to test: 0.001, 0.01, 0.1, 1, 100, 1000 
 - alpha: `$L_1$` (lasso) regularizer on weights of trees. Decent values to test: 0.001, 0.01, 0.1, 1, 100, 1000

]

.scrollable90[
.pull-left.font110[
 
1. .opacity[Crank up the trees and tune learning rate with early stopping]
2. .opacity[Tune tree-specific hyperparameters]
3. .opacity[Add stochastic attributes with]
4. See if adding regularization helps
 - gamma: tested 1, 100, 1000, 10000 -- no effect
 - lambda: tested 0.001, 0.01, 0.1, 1, 100, 1000 -- no effect
 - alpha: tested 0.001, 0.01, 0.1, 1, 100, 1000 -- minor effect

* Preferred value:
   - `alpha` = 1e+04
   - RMSE = 22137.45
]

.pull-right[

.center.bold.font90[ This grid search takes ~5 min ]

```r
hyper_grid <- expand.grid(
 eta = .05,
 max_depth = 3, 
 min_child_weight = 1,
 subsample = .8, 
 colsample_bytree = 1,
 #gamma = c(1, 100, 1000, 10000),
 #lambda = c(1e-2, 0.1, 1, 100, 1000, 10000), 
* alpha = c(1e-2, 0.1, 1, 100, 1000, 10000),
 rmse = 0 # a place to dump results
 )

# grid search
for(i in seq_len(nrow(hyper_grid))) {
 set.seed(123)
 m <- xgb.cv(
 data = X,
 label = Y,
 nrounds = 6000,
 objective = "reg:linear",
 early_stopping_rounds = 50, 
 nfold = 10,
 verbose = 0,
 params = list( 
 eta = hyper_grid$eta[i], 
 max_depth = hyper_grid$max_depth[i],
 min_child_weight = hyper_grid$min_child_weight[i],
* subsample = hyper_grid$subsample[i],
 colsample_bytree = hyper_grid$colsample_bytree[i],
 #gamma = hyper_grid$gamma[i], 
 #lambda = hyper_grid$lambda[i], 
* alpha = hyper_grid$alpha[i]
 ) 
 )
 hyper_grid$rmse[i] <- min(m$evaluation_log$test_rmse_mean)
}

arrange(hyper_grid, rmse)
##    eta max_depth min_child_weight subsample colsample_bytree alpha     rmse
## 1 0.05         3                1         1             0.65 1e+04 22137.86
## 2 0.05         3                1         1             0.65 1e+00 22154.56
## 3 0.05         3                1         1             0.65 1e-01 22189.76
## 4 0.05         3                1         1             0.65 1e-02 22227.70
## 5 0.05         3                1         1             0.65 1e+03 22319.05
## 6 0.05         3                1         1             0.65 1e+02 22379.46
```

]]

.scrollable90[
.pull-left.font110[
1. .opacity[Crank up the trees and tune learning rate with early stopping]
2. .opacity[Tune tree-specific hyperparameters]
3. .opacity[Add stochastic attributes with]
4. .opacity[See if adding regularization helps]
5. If you find hyperparameter values that are substantially different from default settings, be sure to assess the learning rate again
6. Rerun final "optimal" model with `xgb.cv()` to get iterations required and then with `xgboost()` to produce final model

.center.bold.font90[.font130[`final_cv`] test RMSE: 20,581.31]

]

.pull-right[

```r
# parameter list
params <- list(
 eta = 0.05,
 max_depth = 3, 
 min_child_weight = 1,
 subsample = 1, 
 colsample_bytree = 0.65,
 alpha = 1e+04
)

# final cv fit
set.seed(123)
final_cv <- xgb.cv(
 data = X,
 label = Y,
 nrounds = 6000,
 objective = "reg:linear",
 early_stopping_rounds = 50, 
 nfold = 10,
 verbose = 0,
* params = params
 )

# train final model
ames_final_xgb <- xgboost(
 data = X,
 label = Y,
* nrounds = final_cv$best_iteration,
 objective = "reg:linear",
* params = params,
 verbose = 0
)
```

]]

---
class: misk-section-slide

.bold.font250[Feature Interpretation]

---
# Feature Interpretation .red[ code chunk 14]

.pull-left[

.center.bold[Feature Importance]

```r
vip::vip(ames_final_xgb, num_features = 25)
```

]

---
# Feature Interpretation .red[ code chunks 15-16]

.pull-left[
.center.bold[Overall_Qual]

```r
ames_final_xgb %>%
  partial(
    pred.var = "Overall_Qual", 
    n.trees = ames_final_xgb$niter, 
    train = X
    ) %>%
  autoplot(rug = TRUE, train = X)
```

<img src="12-gbm-slides_files/figure-html/xgb-pdp-1.png" style="display: block; margin: auto;" />
]

.pull-right[

.center.bold[Gr_Liv_Area]

```r
ames_final_xgb %>%
  partial(
    pred.var = "Gr_Liv_Area", 
    n.trees = ames_final_xgb$niter, 
    grid.resolution = 50, 
    train = X
    ) %>%
  autoplot(rug = TRUE, train = X)
```

]

---
class: misk-section-slide

.bold.font250[Wrapping Up]

---
# Summary

.scrollable90[

.pull-left[

.bold.center[Random forests:]

* Builds an ensemble of fully grown decision trees (**low bias, high variance**)
 - Correlation between trees is reduced through subsampling the columns
 - Variance is reduced through averaging 
 
* Tuning tends to have minimal impact

* Good accuracy but rarely the best

* Trees are independently grown (embarrassingly parallel)

]

.pull-right[

.bold.center[Gradient boosting machines:]

* Builds an ensemble of small decision trees (**high bias, low variance**)
    - Bias is reduced through sequential learning and fixing past mistakes
    - Variance is controlled with tree parameters & regularization

* Requires more TLC for tuning

* Great accuracy; often a leaderboard model

* Trees are **NOT** independent, but training times are usually pretty fast since trees are not grown too deep; plus XGBoost provides parallel options.

]
]

---
# Packages 📦 to remember

.pull-left.font130[
* Standard GBM
   - __gbm__

* Stochastic GBM
   - __gbm__
   - __h2o__

* Extreme GBM
   - __xgboost__
   - __h2o__
]

.pull-right.font130[

.center[Other packages exists, check out the [machine learning task view](https://cran.r-project.org/web/views/MachineLearning.html)!]
]

---
# Learning More

.pull-left[

.center.font150[[Book website](http://www-bcf.usc.edu/~gareth/ISL/)]
]

.pull-right[

.center.font150[[Book website](https://web.stanford.edu/~hastie/ElemStatLearn/)]
]

---

# Other Great Resources

.font120[

* [How to explain gradient boosting](https://explained.ai/gradient-boosting/)

* [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)

* [Complete Guide to Parameter Tuning in Gradient Boosting (GBM)](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)

* [Complete Guide to Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

]

---
class: clear, center, middle, hide-logo

background-image: url(images/any-questions.jpg)
background-position: center
background-size: cover

---
# Back home

[.center[]](https://github.com/misk-data-science/misk-homl)

.center[https://github.com/misk-data-science/misk-homl]