Feature & Target Engineering

class: misk-title-slide

# .font120[Feature & Target Engineering]

---
# Introduction

Data pre-processing and engineering techniques generally refer to the .blue[___addition, deletion, or transformation of data___].

.pull-left[

.center.bold.font120[Thoughts]

- Substantial time commitment
- 1 hr module doesn't do justice
- Not a "sexy" area to study but well worth your time
- Additional resources to start with:
   - [Feature Engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/)
   - [Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241)

]

.pull-right[

.center.bold.font120[Overview]

- Target engineering
- Missingness
- Feature filtering
- Numeric feature engineering
- Categorical feature engineering
- Dimension reduction
- Proper implementation

]

---
# Prereqs .red[ code chunk 1]

.pull-left[

.center.bold.font120[Packages]

```r
library(dplyr)
library(ggplot2)
library(rsample)
library(recipes)
library(caret)
```

]

.pull-right[

.center.bold.font120[Data]

```r
# ames data
ames <- AmesHousing::make_ames()
# split data
set.seed(123)
split <- initial_split(ames, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)
```

]

---
class: misk-section-slide

.bold.font250[Target Engineering]

---
# Normality correction

.pull-left[

Not a requirement but...

- can improve predictive accuracy for parametric & distance-based models
- can correct for residual assumption violations
- minimizes effects of outliers

plus...

- sometimes used to for shaping the business problem as well

.center[_“taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.”_]

]

.pull-right[

<center>
`$\texttt{Sale_Price} = \beta_0 + \beta_1\texttt{Year_Built} + \epsilon$`
</center>

]

---
# Transformation options

.pull-left[

- log (or log with offset)

- Box-Cox: automates process of finding proper transformation

$$
 \begin{equation} 
 y(\lambda) =
`\begin{cases}
   \frac{y^\lambda-1}{\lambda}, & \text{if}\ \lambda \neq 0 \\
   \log y, & \text{if}\ \lambda = 0.
\end{cases}`
\end{equation}`
$$

- Yeo-Johnson: modified Box-Cox for non-strictly positive values

]

.pull-right[

We'll put these pieces together later

```r
step_log()
step_BoxCox()
step_YeoJohnson()
```

]

---
class: misk-section-slide

.bold.font250[Missingness]

_Many models cannot cope with missing data so imputation strategies may be necessary._

---
# Visualizing .red[ code chunk 2]

An uncleaned version of Ames housing data:

```r
sum(is.na(AmesHousing::ames_raw))
## [1] 13997
```

.pull-left[

```r
AmesHousing::ames_raw %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill=value)) + 
    geom_raster() + 
    coord_flip() +
    scale_y_continuous(NULL, expand = c(0, 0)) +
    scale_fill_grey(name = "", labels = c("Present", "Missing")) +
    xlab("Observation") +
    theme(axis.text.y  = element_text(size = 4))
```

]

.pull-right[

]

---
# Visualizing .red[ code chunk 3]

An uncleaned version of Ames housing data:

```r
sum(is.na(AmesHousing::ames_raw))
## [1] 13997
```

.pull-left[

```r
visdat::vis_miss(AmesHousing::ames_raw, cluster = TRUE)
```

]

.pull-right[

]

---
# Structural vs random .red[ code chunk 4]

.pull-left[

Missing values can be a result of many different reasons; however, these reasons are usually lumped into two categories:

* informative missingess

* missingness at random

]

.pull-right[

```r
AmesHousing::ames_raw %>% 
 filter(is.na(`Garage Type`)) %>% 
 select(`Garage Type`, `Garage Cars`, `Garage Area`)
## # A tibble: 157 x 3
## `Garage Type` `Garage Cars` `Garage Area`
## <chr> <int> <int>
## 1 <NA> 0 0
## 2 <NA> 0 0
## 3 <NA> 0 0
## 4 <NA> 0 0
## 5 <NA> 0 0
## 6 <NA> 0 0
## 7 <NA> 0 0
## 8 <NA> 0 0
## 9 <NA> 0 0
## 10 <NA> 0 0
## # … with 147 more rows
```

]

.center.bold[Determines how you will, and if you can/should, impute.]

---
# Imputation

.pull-left[

Primary methods:

- Estimated statistic (i.e. mean, median, mode)

- K-nearest neighbor

- Tree-based (bagged trees)

]

.pull-right[

.center.font80[.red[Actual values] vs .blue[imputed values]]

]

---
# Imputation

.pull-left[

Primary methods:

- Estimated statistic (i.e. mean, median, mode)

- K-nearest neighbor

- Tree-based (bagged trees)

]

.pull-right[

We'll put these pieces together later

```r
step_meanimpute()
step_medianimpute()
step_modeimpute()
step_knnimpute()
step_bagimpute()
```

]

---
class: misk-section-slide

.bold.font250[Feature Filtering]

---
# More is not always better!

Excessive noisy variables can...

.font120.bold[reduce accuracy]

---
# More is not always better!

Excessive noisy variables can...

.font120.bold[increase computation time]

---
# Options for filtering .red[ code chunk 5]

.pull-left[
Filtering options include:

- removing 
   - zero variance features
   - near-zero variance features
   - highly correlated features (better to do dimension reduction)

- Feature selection
   - beyond scope of module
   - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/)
]

.pull-right[

```r
caret::nearZeroVar(ames_train, saveMetrics= TRUE) %>% 
  rownames_to_column() %>% 
  filter(nzv)
##               rowname  freqRatio percentUnique zeroVar  nzv
## 1              Street  273.87500    0.09095043   FALSE TRUE
## 2               Alley   20.40000    0.13642565   FALSE TRUE
## 3        Land_Contour   22.14607    0.18190086   FALSE TRUE
## 4           Utilities 1098.50000    0.09095043   FALSE TRUE
## 5          Land_Slope   21.77083    0.13642565   FALSE TRUE
## 6         Condition_2  217.70000    0.31832651   FALSE TRUE
## 7           Roof_Matl  120.33333    0.27285130   FALSE TRUE
## 8           Bsmt_Cond   20.87234    0.27285130   FALSE TRUE
## 9      BsmtFin_Type_2   24.35065    0.31832651   FALSE TRUE
## 10       BsmtFin_SF_2  386.60000    9.95907231   FALSE TRUE
## 11            Heating  103.14286    0.22737608   FALSE TRUE
## 12    Low_Qual_Fin_SF 1087.00000    1.09140518   FALSE TRUE
## 13      Kitchen_AbvGr   23.66292    0.18190086   FALSE TRUE
## 14         Functional   39.38462    0.31832651   FALSE TRUE
## 15     Enclosed_Porch  102.94444    6.86675762   FALSE TRUE
## 16 Three_season_porch  723.33333    1.09140518   FALSE TRUE
## 17       Screen_Porch  224.00000    4.36562074   FALSE TRUE
## 18          Pool_Area 2190.00000    0.45475216   FALSE TRUE
## 19            Pool_QC  730.00000    0.22737608   FALSE TRUE
## 20       Misc_Feature   32.19697    0.22737608   FALSE TRUE
## 21           Misc_Val  151.92857    1.36425648   FALSE TRUE
```

]

---
# Options for filtering

.pull-left[
Filtering options include:

- removing 
   - zero variance features
   - near-zero variance features
   - highly correlated features (better to do dimension reduction)

- Feature selection
   - beyond scope of module
   - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/)
]

.pull-right[

We'll put these pieces together later

```r
step_zv()
step_nzv()
step_corr()
```

]

---
class: misk-section-slide

.bold.font250[Numeric Feature Engineering]

---
# Transformations

.pull-left[
* skewness
   - parametric models that have distributional assumptions (i.e. GLMs, regularized models)
   - log
   - Box-Cox or Yeo-Johnson
   
* standardization
   - Models that incorporate linear functions (GLM, NN) and distance functions (i.e. KNN, clustering) of input features are sensitive to the scale of the inputs 
   - centering _and_ scaling so that numeric variables have `$\mu = 0; \sigma = 1$` 
]   
 
.pull-right[

<img src="03-engineering-slides_files/figure-html/standardizing-1.png" style="display: block; margin: auto;" />
]

---
# Transformations

.pull-right[

We'll put these pieces together later

```r
step_log()
step_BoxCox()
step_YeoJohnson()
step_center()
step_scale()
```

]

---
class: misk-section-slide

.bold.font250[Categorical Feature Engineering]

---
# One-hot & Dummy encoding

.pull-left[

Many models require all predictor variables to be numeric (i.e. GLMs, SVMs, NNets)

<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:right;"> id </th>
 <th style="text-align:left;"> x </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:right;"> 1 </td>
 <td style="text-align:left;"> c </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 2 </td>
 <td style="text-align:left;"> c </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 3 </td>
 <td style="text-align:left;"> c </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 4 </td>
 <td style="text-align:left;"> b </td>
 </tr>
</tbody>
</table>
 
Two most common approaches include...

]

.pull-right[

.bold.center[Dummy encoding]

.bold.center[One-hot encoding]

]

---
# Label encoding .red[ code chunk 6]

.pull-left[
* One-hot and dummy encoding are not good when:
   - you have a lot of categorical features
   - with high cardinality
   - or you have ordinal features

* Label encoding:
   - pure numeric conversion of the levels of a categorical variable
   - most common: ordinal encoding
]

.pull-right[

.center.bold[Quality variables with natural ordering]

```r
ames_train %>% select(matches("Qual|QC|Qu"))
## # A tibble: 2,199 x 9
## Overall_Qual Exter_Qual Bsmt_Qual Heating_QC Low_Qual_Fin_SF Kitchen_Qual
## <fct> <fct> <fct> <fct> <int> <fct> 
## 1 Above_Avera… Typical Typical Fair 0 Typical 
## 2 Average Typical Typical Typical 0 Typical 
## 3 Above_Avera… Typical Typical Typical 0 Good 
## 4 Average Typical Good Good 0 Typical 
## 5 Above_Avera… Typical Typical Excellent 0 Good 
## 6 Very_Good Good Good Excellent 0 Good 
## 7 Very_Good Good Good Excellent 0 Good 
## 8 Very_Good Good Good Excellent 0 Good 
## 9 Above_Avera… Typical Good Good 0 Typical 
## 10 Above_Avera… Typical Good Good 0 Typical 
## # … with 2,189 more rows, and 3 more variables: Fireplace_Qu <fct>,
## # Garage_Qual <fct>, Pool_QC <fct>
```

]

---
# Label encoding .red[ code chunk 7]

.pull-left[
* One-hot and dummy encoding are not good when:
   - you have a lot of categorical features
   - with high cardinality
   - or you have ordinal features

* Label encoding:
   - pure numeric conversion of the levels of a categorical variable
   - most common: ordinal encoding
]

.pull-right[

.center.bold[Original encoding for `Overall_Qual`]

```r
count(ames_train, Overall_Qual)
## # A tibble: 10 x 2
## Overall_Qual n
## <fct> <int>
## 1 Very_Poor 4
## 2 Poor 9
## 3 Fair 31
## 4 Below_Average 175
## 5 Average 602
## 6 Above_Average 560
## 7 Good 437
## 8 Very_Good 275
## 9 Excellent 85
## 10 Very_Excellent 21
```

]

---
# Label encoding .red[ code chunk 8]

.pull-left[
* One-hot and dummy encoding are not good when:
   - you have a lot of categorical features
   - with high cardinality
   - or you have ordinal features

* Label encoding:
   - pure numeric conversion of the levels of a categorical variable
   - most common: ordinal encoding
]

.pull-right[

.center.bold[Label/ordinal encoding for `Overall_Qual`]

```r
recipe(Sale_Price ~ ., data = ames_train) %>%
 step_integer(Overall_Qual) %>%
 prep(ames_train) %>%
 bake(ames_train) %>%
 count(Overall_Qual)
## # A tibble: 10 x 2
## Overall_Qual n
## <dbl> <int>
## 1 1 4
## 2 2 9
## 3 3 31
## 4 4 175
## 5 5 602
## 6 6 560
## 7 7 437
## 8 8 275
## 9 9 85
## 10 10 21
```

]

---
# Common categorical encodings

We'll put these pieces together later

```r
step_dummy()
step_dummy(one_hot = TRUE)
step_integer()
step_ordinalscore()
```

---
class: misk-section-slide

.bold.font250[Dimension Reduction]

---
# PCA

.pull-left[
* We can use PCA for downstream modeling

* In the Ames data, there are potential clusters of highly correlated variables:

- proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc.
   - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc.

* It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them.

* In fact, we can explain 95% of the variance in our numeric features with 38 PCs

]

.pull-right[

]

---
# PCA

.pull-left[
* We can use PCA for downstream modeling

* In the Ames data, there are potential clusters of highly correlated variables:

- proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc.
   - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc.

* It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them.

* In fact, we can explain 95% of the variance in our numeric features with 38 PCs

]

.pull-right[

We'll put these pieces together later

```r
step_pca()
step_kpca()
step_pls()
step_spatialsign()
```

]

---
class: misk-section-slide

.bold.font250[Blueprints]

---
# Sequential steps

.pull-left[

.bold.center.font120[Some thoughts to consider]

- If using a log or Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. 
- Standardize your numeric features prior to one-hot/dummy encoding.
- If you are lumping infrequently categories together, do so before one-hot/dummy encoding.
- Although you can perform dimension reduction procedures on categorical features, it is common to primarily do so on numeric features when doing so for feature engineering purposes.

]

.pull-right[

.bold.center.font120[Suggested ordering]

1. Filter out zero or near-zero variance features
2. Perform imputation if required
3. Normalize to resolve numeric feature skewness
4. Standardize (center and scale) numeric features
5. Perform dimension reduction (i.e. PCA) on numeric features
6. Create one-hot or dummy encoded features

]

---
# Data leakage

___Data leakage___ is when information from outside the training dataset is used to create the model.

- Often occurs when doing feature engineering
- Feature engineering should be done in isolation of each resampling iteration

---
# Putting the process together

.pull-left[
.font120[

* __recipes__ provides a convenient way to create feature engineering blueprints

]
]

.pull-right[

]

.center.bold.font120[https://tidymodels.github.io/recipes/index.html]

---
# Putting the process together

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. prepare: estimate parameters based on training data
   3. bake/juice: apply blueprint to new data

]

---
# Putting the process together .red[ code chunk 9]

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. .bold[recipe: define your pre-processing blueprint]
   2. prepare: estimate parameters based on training data
   3. bake/juice: apply blueprint to new data

.center.blue[Check out all the available `step_xxx()` functions at http://bit.ly/step_functions]

]

.pull-right[

```r
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
 step_nzv(all_nominal()) %>%
 step_center(all_numeric(), -all_outcomes()) %>%
 step_scale(all_numeric(), -all_outcomes()) %>%
 step_integer(matches("Qual|Cond|QC|Qu"))
blueprint
## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 80
## 
## Operations:
## 
## Sparse, unbalanced variable filter on all_nominal
## Centering for all_numeric, -, all_outcomes()
## Scaling for all_numeric, -, all_outcomes()
## Integer encoding for matches, Qual|Cond|QC|Qu
```

]

---
# Putting the process together .red[ code chunk 10]

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. .bold[prepare: estimate parameters based on training data]
   3. bake/juice: apply blueprint to new data

]

.pull-right[

```r
prepare <- prep(blueprint, training = ames_train)
prepare
## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 80
## 
## Training data contained 2199 data points and no missing data.
## 
## Operations:
## 
## Sparse, unbalanced variable filter removed Street, Alley, Land_Contour, ... [trained]
## Centering for Lot_Frontage, Lot_Area, ... [trained]
## Scaling for Lot_Frontage, Lot_Area, ... [trained]
## Integer encoding for Condition_1, Overall_Qual, Overall_Cond, ... [trained]
```

]

---
# Putting the process together .red[ code chunk 11]

.scrollable90[
.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. prepare: estimate parameters based on training data
   3. .bold[bake: apply blueprint to new data]

]

.pull-right[

```r
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
baked_train
## # A tibble: 2,199 x 68
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Lot_Shape Lot_Config Neighborhood
## <fct> <fct> <dbl> <dbl> <fct> <fct> <fct> 
## 1 One_Story_… Resident… 2.54 2.51 Slightly… Corner North_Ames 
## 2 One_Story_… Resident… 0.678 0.165 Regular Inside North_Ames 
## 3 One_Story_… Resident… 0.709 0.472 Slightly… Corner North_Ames 
## 4 Two_Story_… Resident… 0.495 0.421 Slightly… Inside Gilbert 
## 5 Two_Story_… Resident… 0.617 -0.0267 Slightly… Inside Gilbert 
## 6 One_Story_… Resident… -0.510 -0.615 Regular Inside Stone_Brook 
## 7 One_Story_… Resident… -0.449 -0.605 Slightly… Inside Stone_Brook 
## 8 One_Story_… Resident… -0.571 -0.560 Slightly… Inside Stone_Brook 
## 9 Two_Story_… Resident… 0.526 -0.0241 Slightly… Corner Gilbert 
## 10 Two_Story_… Resident… 0.160 -0.210 Slightly… Inside Gilbert 
## # … with 2,189 more rows, and 61 more variables: Condition_1 <dbl>,
## # Bldg_Type <fct>, House_Style <fct>, Overall_Qual <dbl>, Overall_Cond <dbl>,
## # Year_Built <dbl>, Year_Remod_Add <dbl>, Roof_Style <fct>,
## # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
## # Mas_Vnr_Area <dbl>, Exter_Qual <dbl>, Exter_Cond <dbl>, Foundation <fct>,
## # Bsmt_Qual <dbl>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
## # BsmtFin_SF_1 <dbl>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## # Total_Bsmt_SF <dbl>, Heating_QC <dbl>, Central_Air <fct>, Electrical <fct>,
## # First_Flr_SF <dbl>, Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>,
## # Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
## # Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
## # Kitchen_Qual <dbl>, TotRms_AbvGrd <dbl>, Fireplaces <dbl>,
## # Fireplace_Qu <dbl>, Garage_Type <fct>, Garage_Finish <fct>,
## # Garage_Cars <dbl>, Garage_Area <dbl>, Garage_Qual <dbl>, Garage_Cond <dbl>,
## # Paved_Drive <fct>, Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>,
## # Enclosed_Porch <dbl>, Three_season_porch <dbl>, Screen_Porch <dbl>,
## # Pool_Area <dbl>, Fence <fct>, Misc_Val <dbl>, Mo_Sold <dbl>,
## # Year_Sold <dbl>, Sale_Type <fct>, Sale_Condition <dbl>, Longitude <dbl>,
## # Latitude <dbl>, Sale_Price <int>
```

]
]

---
# Simplifying with __caret__

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. prepare: estimate parameters based on training data
   3. bake: apply blueprint to new data
   
* Luckily, __caret__ simplifies this process for us.
   1. We supply __caret__ a recipe
   2. __caret__ will prepare & bake within each resample

]

.pull-right[

]

---
# Putting the process together .red[ code chunk 12]

.scrollable90[
.pull-left[
Let's add a blueprint to our modeling process for analyzing the Ames housing data:

1. Split into training vs testing data

2. .blue[Create feature engineering blueprint]

3. Specify a resampling procedure

4. Create our hyperparameter grid

5. Execute grid search

6. Evaluate performance
]

.pull-right[

.center.bold[ This grid search takes ~8 min ]

```r
# 1. stratified sampling with the rsample package
set.seed(123)
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)
# 2. Feature engineering
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
 step_nzv(all_nominal()) %>%
 step_integer(matches("Qual|Cond|QC|Qu")) %>%
 step_center(all_numeric(), -all_outcomes()) %>%
 step_scale(all_numeric(), -all_outcomes()) %>%
 step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)
# 3. create a resampling method
cv <- trainControl(
 method = "repeatedcv", 
 number = 10, 
 repeats = 5
 )
# 4. create a hyperparameter grid search
hyper_grid <- expand.grid(k = seq(2, 25, by = 1))
# 5. execute grid search with knn model
# use RMSE as preferred metric
knn_fit <- train(
 blueprint, 
 data = ames_train, 
 method = "knn", 
 trControl = cv, 
 tuneGrid = hyper_grid,
 metric = "RMSE"
 )
# 6. evaluate results
# print model results
knn_fit
## k-Nearest Neighbors 
## 
## 2053 samples
## 80 predictor
## 
## Recipe steps: nzv, integer, center, scale, dummy 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1847, 1847, 1849, 1847, 1847, 1849, ... 
## Resampling results across tuning parameters:
## 
## k RMSE Rsquared MAE 
## 2 36157.20 0.8021972 22517.39
## 3 35146.27 0.8150895 21605.78
## 4 34894.07 0.8192975 21391.95
## 5 34345.23 0.8275098 20991.58
## 6 34019.05 0.8326574 20900.86
## 7 33617.98 0.8395995 20751.00
## 8 33546.98 0.8421721 20718.21
## 9 33404.00 0.8449132 20676.68
## 10 33249.63 0.8474507 20654.95
## 11 33136.92 0.8498865 20619.54
## 12 33086.33 0.8516700 20636.61
## 13 33115.58 0.8524821 20685.93
## 14 33158.91 0.8531012 20723.65
## 15 33218.85 0.8538323 20795.80
## 16 33239.91 0.8544183 20832.18
## 17 33301.91 0.8543697 20944.86
## 18 33356.75 0.8545305 21023.53
## 19 33384.70 0.8548460 21082.04
## 20 33425.16 0.8549638 21143.45
## 21 33509.65 0.8546262 21232.24
## 22 33571.28 0.8543670 21280.19
## 23 33596.00 0.8542518 21324.64
## 24 33671.68 0.8541201 21385.34
## 25 33730.53 0.8540286 21419.32
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 12.
# plot cross validation results
ggplot(knn_fit$results, aes(k, RMSE)) + 
 geom_line() +
 geom_point() +
 scale_y_continuous(labels = scales::dollar)
```

]
]

---
# Putting the process together

.center.bold.font120[Feature engineering alone reduced our error by $10,000!]

---
class: clear, center, middle, hide-logo

background-image: url(images/any-questions.jpg)
background-position: center
background-size: cover

---
# Back home

[.center[]](https://github.com/misk-data-science/misk-homl)

.center[https://github.com/misk-data-science/misk-homl]