class: misk-title-slide <br><br><br><br><br> # .font120[Feature & Target Engineering] --- # Introduction Data pre-processing and engineering techniques generally refer to the .blue[___addition, deletion, or transformation of data___]. .pull-left[ .center.bold.font120[Thoughts] - Substantial time commitment - 1 hr module doesn't do justice - Not a "sexy" area to study but well worth your time - Additional resources to start with: - [Feature Engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/) - [Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241) ] -- .pull-right[ .center.bold.font120[Overview] - Target engineering - Missingness - Feature filtering - Numeric feature engineering - Categorical feature engineering - Dimension reduction - Proper implementation ] --- # Prereqs .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 1] .pull-left[ .center.bold.font120[Packages] ```r library(dplyr) library(ggplot2) library(rsample) library(recipes) library(caret) ``` ] .pull-right[ .center.bold.font120[Data] ```r # ames data ames <- AmesHousing::make_ames() # split data set.seed(123) split <- initial_split(ames, strata = "Sale_Price") ames_train <- training(split) ames_test <- testing(split) ``` ] --- class: misk-section-slide <br><br><br><br><br><br><br> .bold.font250[Target Engineering] --- # Normality correction .pull-left[ Not a requirement but... - can improve predictive accuracy for parametric & distance-based models - can correct for residual assumption violations - minimizes effects of outliers plus... - sometimes used to for shaping the business problem as well .center[_“taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.”_] ] .pull-right[ <br><br> <center> `\(\texttt{Sale_Price} = \beta_0 + \beta_1\texttt{Year_Built} + \epsilon\)` </center> <img src="03-engineering-slides_files/figure-html/skewed-residuals-1.png" style="display: block; margin: auto;" /> ] --- # Transformation options .pull-left[ - log (or log with offset) - Box-Cox: automates process of finding proper transformation $$ \begin{equation} y(\lambda) = `\begin{cases} \frac{y^\lambda-1}{\lambda}, & \text{if}\ \lambda \neq 0 \\ \log y, & \text{if}\ \lambda = 0. \end{cases}` \end{equation}` $$ - Yeo-Johnson: modified Box-Cox for non-strictly positive values ] .pull-right[ We'll put these pieces together later ```r step_log() step_BoxCox() step_YeoJohnson() ``` ] <img src="03-engineering-slides_files/figure-html/distribution-comparison-1.png" style="display: block; margin: auto;" /> --- class: misk-section-slide <br><br><br><br><br><br><br> .bold.font250[Missingness] _Many models cannot cope with missing data so imputation strategies may <br>be necessary._ --- # Visualizing .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 2] An uncleaned version of Ames housing data: ```r sum(is.na(AmesHousing::ames_raw)) ## [1] 13997 ``` .pull-left[ ```r AmesHousing::ames_raw %>% is.na() %>% reshape2::melt() %>% ggplot(aes(Var2, Var1, fill=value)) + geom_raster() + coord_flip() + scale_y_continuous(NULL, expand = c(0, 0)) + scale_fill_grey(name = "", labels = c("Present", "Missing")) + xlab("Observation") + theme(axis.text.y = element_text(size = 4)) ``` ] .pull-right[ <img src="03-engineering-slides_files/figure-html/missing-distribution-plot-1.png" style="display: block; margin: auto;" /> ] --- # Visualizing .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 3] An uncleaned version of Ames housing data: ```r sum(is.na(AmesHousing::ames_raw)) ## [1] 13997 ``` .pull-left[ ```r visdat::vis_miss(AmesHousing::ames_raw, cluster = TRUE) ``` ] .pull-right[ <img src="03-engineering-slides_files/figure-html/missing-distribution-plot2-1.png" style="display: block; margin: auto;" /> ] --- # Structural vs random .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 4] .pull-left[ Missing values can be a result of many different reasons; however, these reasons are usually lumped into two categories: * informative missingess * missingness at random ] .pull-right[ ```r AmesHousing::ames_raw %>% filter(is.na(`Garage Type`)) %>% select(`Garage Type`, `Garage Cars`, `Garage Area`) ## # A tibble: 157 x 3 ## `Garage Type` `Garage Cars` `Garage Area` ## <chr> <int> <int> ## 1 <NA> 0 0 ## 2 <NA> 0 0 ## 3 <NA> 0 0 ## 4 <NA> 0 0 ## 5 <NA> 0 0 ## 6 <NA> 0 0 ## 7 <NA> 0 0 ## 8 <NA> 0 0 ## 9 <NA> 0 0 ## 10 <NA> 0 0 ## # … with 147 more rows ``` ] <br> .center.bold[Determines how you will, and if you can/should, impute.] --- # Imputation .pull-left[ Primary methods: - Estimated statistic (i.e. mean, median, mode) - K-nearest neighbor - Tree-based (bagged trees) ] .pull-right[ .center.font80[.red[Actual values] vs .blue[imputed values]] <img src="03-engineering-slides_files/figure-html/imputation-examples-1.png" style="display: block; margin: auto;" /> ] --- # Imputation .pull-left[ Primary methods: - Estimated statistic (i.e. mean, median, mode) - K-nearest neighbor - Tree-based (bagged trees) ] .pull-right[ We'll put these pieces together later ```r step_meanimpute() step_medianimpute() step_modeimpute() step_knnimpute() step_bagimpute() ``` ] --- class: misk-section-slide <br><br><br><br><br><br><br> .bold.font250[Feature Filtering] --- # More is not always better! Excessive noisy variables can... .font120.bold[reduce accuracy] <img src="images/accuracy-comparison-1.png" width="2560" style="display: block; margin: auto;" /> --- # More is not always better! Excessive noisy variables can... .font120.bold[increase computation time] <img src="images/impact-on-time-1.png" width="2560" style="display: block; margin: auto;" /> --- # Options for filtering .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 5] .pull-left[ Filtering options include: - removing - zero variance features - near-zero variance features - highly correlated features (better to do dimension reduction) - Feature selection - beyond scope of module - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/) ] .pull-right[ ```r caret::nearZeroVar(ames_train, saveMetrics= TRUE) %>% rownames_to_column() %>% filter(nzv) ## rowname freqRatio percentUnique zeroVar nzv ## 1 Street 273.87500 0.09095043 FALSE TRUE ## 2 Alley 20.40000 0.13642565 FALSE TRUE ## 3 Land_Contour 22.14607 0.18190086 FALSE TRUE ## 4 Utilities 1098.50000 0.09095043 FALSE TRUE ## 5 Land_Slope 21.77083 0.13642565 FALSE TRUE ## 6 Condition_2 217.70000 0.31832651 FALSE TRUE ## 7 Roof_Matl 120.33333 0.27285130 FALSE TRUE ## 8 Bsmt_Cond 20.87234 0.27285130 FALSE TRUE ## 9 BsmtFin_Type_2 24.35065 0.31832651 FALSE TRUE ## 10 BsmtFin_SF_2 386.60000 9.95907231 FALSE TRUE ## 11 Heating 103.14286 0.22737608 FALSE TRUE ## 12 Low_Qual_Fin_SF 1087.00000 1.09140518 FALSE TRUE ## 13 Kitchen_AbvGr 23.66292 0.18190086 FALSE TRUE ## 14 Functional 39.38462 0.31832651 FALSE TRUE ## 15 Enclosed_Porch 102.94444 6.86675762 FALSE TRUE ## 16 Three_season_porch 723.33333 1.09140518 FALSE TRUE ## 17 Screen_Porch 224.00000 4.36562074 FALSE TRUE ## 18 Pool_Area 2190.00000 0.45475216 FALSE TRUE ## 19 Pool_QC 730.00000 0.22737608 FALSE TRUE ## 20 Misc_Feature 32.19697 0.22737608 FALSE TRUE ## 21 Misc_Val 151.92857 1.36425648 FALSE TRUE ``` ] --- # Options for filtering .pull-left[ Filtering options include: - removing - zero variance features - near-zero variance features - highly correlated features (better to do dimension reduction) - Feature selection - beyond scope of module - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/) ] .pull-right[ We'll put these pieces together later ```r step_zv() step_nzv() step_corr() ``` ] --- class: misk-section-slide <br><br><br><br><br><br><br> .bold.font250[Numeric Feature Engineering] --- # Transformations .pull-left[ * skewness - parametric models that have distributional assumptions (i.e. GLMs, regularized models) - log - Box-Cox or Yeo-Johnson * standardization - Models that incorporate linear functions (GLM, NN) and distance functions (i.e. KNN, clustering) of input features are sensitive to the scale of the inputs - centering _and_ scaling so that numeric variables have `\(\mu = 0; \sigma = 1\)` ] .pull-right[ <img src="03-engineering-slides_files/figure-html/standardizing-1.png" style="display: block; margin: auto;" /> ] --- # Transformations .pull-left[ * skewness - parametric models that have distributional assumptions (i.e. GLMs, regularized models) - log - Box-Cox or Yeo-Johnson * standardization - Models that incorporate linear functions (GLM, NN) and distance functions (i.e. KNN, clustering) of input features are sensitive to the scale of the inputs - centering _and_ scaling so that numeric variables have `\(\mu = 0; \sigma = 1\)` ] .pull-right[ We'll put these pieces together later ```r step_log() step_BoxCox() step_YeoJohnson() step_center() step_scale() ``` ] --- class: misk-section-slide <br><br><br><br><br><br><br> .bold.font250[Categorical Feature Engineering] --- # One-hot & Dummy encoding .pull-left[ Many models require all predictor variables to be numeric (i.e. GLMs, SVMs, NNets) <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> b </td> </tr> </tbody> </table> Two most common approaches include... ] .pull-right[ .bold.center[Dummy encoding] <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> xc </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> .bold.center[One-hot encoding] <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> xb </th> <th style="text-align:right;"> xc </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] --- # Label encoding .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 6] .pull-left[ * One-hot and dummy encoding are not good when: - you have a lot of categorical features - with high cardinality - or you have ordinal features * Label encoding: - pure numeric conversion of the levels of a categorical variable - most common: ordinal encoding ] .pull-right[ .center.bold[Quality variables with natural ordering] ```r ames_train %>% select(matches("Qual|QC|Qu")) ## # A tibble: 2,199 x 9 ## Overall_Qual Exter_Qual Bsmt_Qual Heating_QC Low_Qual_Fin_SF Kitchen_Qual ## <fct> <fct> <fct> <fct> <int> <fct> ## 1 Above_Avera… Typical Typical Fair 0 Typical ## 2 Average Typical Typical Typical 0 Typical ## 3 Above_Avera… Typical Typical Typical 0 Good ## 4 Average Typical Good Good 0 Typical ## 5 Above_Avera… Typical Typical Excellent 0 Good ## 6 Very_Good Good Good Excellent 0 Good ## 7 Very_Good Good Good Excellent 0 Good ## 8 Very_Good Good Good Excellent 0 Good ## 9 Above_Avera… Typical Good Good 0 Typical ## 10 Above_Avera… Typical Good Good 0 Typical ## # … with 2,189 more rows, and 3 more variables: Fireplace_Qu <fct>, ## # Garage_Qual <fct>, Pool_QC <fct> ``` ] --- # Label encoding .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 7] .pull-left[ * One-hot and dummy encoding are not good when: - you have a lot of categorical features - with high cardinality - or you have ordinal features * Label encoding: - pure numeric conversion of the levels of a categorical variable - most common: ordinal encoding ] .pull-right[ .center.bold[Original encoding for `Overall_Qual`] ```r count(ames_train, Overall_Qual) ## # A tibble: 10 x 2 ## Overall_Qual n ## <fct> <int> ## 1 Very_Poor 4 ## 2 Poor 9 ## 3 Fair 31 ## 4 Below_Average 175 ## 5 Average 602 ## 6 Above_Average 560 ## 7 Good 437 ## 8 Very_Good 275 ## 9 Excellent 85 ## 10 Very_Excellent 21 ``` ] --- # Label encoding .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 8] .pull-left[ * One-hot and dummy encoding are not good when: - you have a lot of categorical features - with high cardinality - or you have ordinal features * Label encoding: - pure numeric conversion of the levels of a categorical variable - most common: ordinal encoding ] .pull-right[ .center.bold[Label/ordinal encoding for `Overall_Qual`] ```r recipe(Sale_Price ~ ., data = ames_train) %>% step_integer(Overall_Qual) %>% prep(ames_train) %>% bake(ames_train) %>% count(Overall_Qual) ## # A tibble: 10 x 2 ## Overall_Qual n ## <dbl> <int> ## 1 1 4 ## 2 2 9 ## 3 3 31 ## 4 4 175 ## 5 5 602 ## 6 6 560 ## 7 7 437 ## 8 8 275 ## 9 9 85 ## 10 10 21 ``` ] --- # Common categorical encodings We'll put these pieces together later ```r step_dummy() step_dummy(one_hot = TRUE) step_integer() step_ordinalscore() ``` --- class: misk-section-slide <br><br><br><br><br><br><br> .bold.font250[Dimension Reduction] --- # PCA .pull-left[ * We can use PCA for downstream modeling * In the Ames data, there are potential clusters of highly correlated variables: - proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc. - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc. * It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them. * In fact, we can explain 95% of the variance in our numeric features with 38 PCs ] .pull-right[ <img src="03-engineering-slides_files/figure-html/pca-1.png" style="display: block; margin: auto;" /> ] --- # PCA .pull-left[ * We can use PCA for downstream modeling * In the Ames data, there are potential clusters of highly correlated variables: - proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc. - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc. * It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them. * In fact, we can explain 95% of the variance in our numeric features with 38 PCs ] .pull-right[ We'll put these pieces together later ```r step_pca() step_kpca() step_pls() step_spatialsign() ``` ] --- class: misk-section-slide <br><br><br><br><br><br><br> .bold.font250[Blueprints] --- # Sequential steps .pull-left[ .bold.center.font120[Some thoughts to consider] - If using a log or Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. - Standardize your numeric features prior to one-hot/dummy encoding. - If you are lumping infrequently categories together, do so before one-hot/dummy encoding. - Although you can perform dimension reduction procedures on categorical features, it is common to primarily do so on numeric features when doing so for feature engineering purposes. ] -- .pull-right[ .bold.center.font120[Suggested ordering] 1. Filter out zero or near-zero variance features 2. Perform imputation if required 3. Normalize to resolve numeric feature skewness 4. Standardize (center and scale) numeric features 5. Perform dimension reduction (i.e. PCA) on numeric features 6. Create one-hot or dummy encoded features ] --- # Data leakage ___Data leakage___ is when information from outside the training dataset is used to create the model. - Often occurs when doing feature engineering - Feature engineering should be done in isolation of each resampling iteration <img src="images/data-leakage.png" width="80%" height="80%" style="display: block; margin: auto;" /> --- # Putting the process together .pull-left[ .font120[ * __recipes__ provides a convenient way to create feature engineering blueprints ] ] .pull-right[ <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/recipes.png" width="70%" height="70%" style="display: block; margin: auto;" /> ] .center.bold.font120[https://tidymodels.github.io/recipes/index.html] --- # Putting the process together .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. prepare: estimate parameters based on training data 3. bake/juice: apply blueprint to new data ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 9] .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. .bold[recipe: define your pre-processing blueprint] 2. prepare: estimate parameters based on training data 3. bake/juice: apply blueprint to new data <br> .center.blue[Check out all the available `step_xxx()` functions at http://bit.ly/step_functions] ] .pull-right[ ```r blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>% step_nzv(all_nominal()) %>% step_center(all_numeric(), -all_outcomes()) %>% step_scale(all_numeric(), -all_outcomes()) %>% step_integer(matches("Qual|Cond|QC|Qu")) blueprint ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 80 ## ## Operations: ## ## Sparse, unbalanced variable filter on all_nominal ## Centering for all_numeric, -, all_outcomes() ## Scaling for all_numeric, -, all_outcomes() ## Integer encoding for matches, Qual|Cond|QC|Qu ``` ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 10] .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. .bold[prepare: estimate parameters based on training data] 3. bake/juice: apply blueprint to new data ] .pull-right[ ```r prepare <- prep(blueprint, training = ames_train) prepare ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 80 ## ## Training data contained 2199 data points and no missing data. ## ## Operations: ## ## Sparse, unbalanced variable filter removed Street, Alley, Land_Contour, ... [trained] ## Centering for Lot_Frontage, Lot_Area, ... [trained] ## Scaling for Lot_Frontage, Lot_Area, ... [trained] ## Integer encoding for Condition_1, Overall_Qual, Overall_Cond, ... [trained] ``` ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 11] .scrollable90[ .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. prepare: estimate parameters based on training data 3. .bold[bake: apply blueprint to new data] ] .pull-right[ ```r baked_train <- bake(prepare, new_data = ames_train) baked_test <- bake(prepare, new_data = ames_test) baked_train ## # A tibble: 2,199 x 68 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Lot_Shape Lot_Config Neighborhood ## <fct> <fct> <dbl> <dbl> <fct> <fct> <fct> ## 1 One_Story_… Resident… 2.54 2.51 Slightly… Corner North_Ames ## 2 One_Story_… Resident… 0.678 0.165 Regular Inside North_Ames ## 3 One_Story_… Resident… 0.709 0.472 Slightly… Corner North_Ames ## 4 Two_Story_… Resident… 0.495 0.421 Slightly… Inside Gilbert ## 5 Two_Story_… Resident… 0.617 -0.0267 Slightly… Inside Gilbert ## 6 One_Story_… Resident… -0.510 -0.615 Regular Inside Stone_Brook ## 7 One_Story_… Resident… -0.449 -0.605 Slightly… Inside Stone_Brook ## 8 One_Story_… Resident… -0.571 -0.560 Slightly… Inside Stone_Brook ## 9 Two_Story_… Resident… 0.526 -0.0241 Slightly… Corner Gilbert ## 10 Two_Story_… Resident… 0.160 -0.210 Slightly… Inside Gilbert ## # … with 2,189 more rows, and 61 more variables: Condition_1 <dbl>, ## # Bldg_Type <fct>, House_Style <fct>, Overall_Qual <dbl>, Overall_Cond <dbl>, ## # Year_Built <dbl>, Year_Remod_Add <dbl>, Roof_Style <fct>, ## # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>, ## # Mas_Vnr_Area <dbl>, Exter_Qual <dbl>, Exter_Cond <dbl>, Foundation <fct>, ## # Bsmt_Qual <dbl>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, ## # BsmtFin_SF_1 <dbl>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, ## # Total_Bsmt_SF <dbl>, Heating_QC <dbl>, Central_Air <fct>, Electrical <fct>, ## # First_Flr_SF <dbl>, Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>, ## # Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, ## # Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>, ## # Kitchen_Qual <dbl>, TotRms_AbvGrd <dbl>, Fireplaces <dbl>, ## # Fireplace_Qu <dbl>, Garage_Type <fct>, Garage_Finish <fct>, ## # Garage_Cars <dbl>, Garage_Area <dbl>, Garage_Qual <dbl>, Garage_Cond <dbl>, ## # Paved_Drive <fct>, Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, ## # Enclosed_Porch <dbl>, Three_season_porch <dbl>, Screen_Porch <dbl>, ## # Pool_Area <dbl>, Fence <fct>, Misc_Val <dbl>, Mo_Sold <dbl>, ## # Year_Sold <dbl>, Sale_Type <fct>, Sale_Condition <dbl>, Longitude <dbl>, ## # Latitude <dbl>, Sale_Price <int> ``` ] ] --- # Simplifying with __caret__ .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. prepare: estimate parameters based on training data 3. bake: apply blueprint to new data * Luckily, __caret__ simplifies this process for us. 1. We supply __caret__ a recipe 2. __caret__ will prepare & bake within each resample ] .pull-right[ <br> <img src="https://media.giphy.com/media/Rl9Yqavfj2Ula/giphy.gif" width="90%" height="90%" style="display: block; margin: auto;" /> ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 12] .scrollable90[ .pull-left[ Let's add a blueprint to our modeling process for analyzing the Ames housing data: 1. Split into training vs testing data 2. .blue[Create feature engineering blueprint] 3. Specify a resampling procedure 4. Create our hyperparameter grid 5. Execute grid search 6. Evaluate performance ] .pull-right[ .center.bold[
<i class="fas fa-exclamation-triangle faa-FALSE animated " style=" color:red;"></i>
This grid search takes ~8 min
<i class="fas fa-exclamation-triangle faa-FALSE animated " style=" color:red;"></i>
] ```r # 1. stratified sampling with the rsample package set.seed(123) split <- initial_split(ames, prop = 0.7, strata = "Sale_Price") ames_train <- training(split) ames_test <- testing(split) # 2. Feature engineering blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>% step_nzv(all_nominal()) %>% step_integer(matches("Qual|Cond|QC|Qu")) %>% step_center(all_numeric(), -all_outcomes()) %>% step_scale(all_numeric(), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) # 3. create a resampling method cv <- trainControl( method = "repeatedcv", number = 10, repeats = 5 ) # 4. create a hyperparameter grid search hyper_grid <- expand.grid(k = seq(2, 25, by = 1)) # 5. execute grid search with knn model # use RMSE as preferred metric knn_fit <- train( blueprint, data = ames_train, method = "knn", trControl = cv, tuneGrid = hyper_grid, metric = "RMSE" ) # 6. evaluate results # print model results knn_fit ## k-Nearest Neighbors ## ## 2053 samples ## 80 predictor ## ## Recipe steps: nzv, integer, center, scale, dummy ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1847, 1847, 1849, 1847, 1847, 1849, ... ## Resampling results across tuning parameters: ## ## k RMSE Rsquared MAE ## 2 36157.20 0.8021972 22517.39 ## 3 35146.27 0.8150895 21605.78 ## 4 34894.07 0.8192975 21391.95 ## 5 34345.23 0.8275098 20991.58 ## 6 34019.05 0.8326574 20900.86 ## 7 33617.98 0.8395995 20751.00 ## 8 33546.98 0.8421721 20718.21 ## 9 33404.00 0.8449132 20676.68 ## 10 33249.63 0.8474507 20654.95 ## 11 33136.92 0.8498865 20619.54 ## 12 33086.33 0.8516700 20636.61 ## 13 33115.58 0.8524821 20685.93 ## 14 33158.91 0.8531012 20723.65 ## 15 33218.85 0.8538323 20795.80 ## 16 33239.91 0.8544183 20832.18 ## 17 33301.91 0.8543697 20944.86 ## 18 33356.75 0.8545305 21023.53 ## 19 33384.70 0.8548460 21082.04 ## 20 33425.16 0.8549638 21143.45 ## 21 33509.65 0.8546262 21232.24 ## 22 33571.28 0.8543670 21280.19 ## 23 33596.00 0.8542518 21324.64 ## 24 33671.68 0.8541201 21385.34 ## 25 33730.53 0.8540286 21419.32 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was k = 12. # plot cross validation results ggplot(knn_fit$results, aes(k, RMSE)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::dollar) ``` <img src="03-engineering-slides_files/figure-html/example-blue-print-application-1.png" style="display: block; margin: auto;" /> ] ] --- # Putting the process together .center.bold.font120[Feature engineering alone reduced our error by $10,000!] <img src="https://media1.tenor.com/images/2b6d0826f02a9ba7c9d4384a740013e9/tenor.gif?itemid=5531028" width="90%" height="90%" style="display: block; margin: auto;" /> --- class: clear, center, middle, hide-logo background-image: url(images/any-questions.jpg) background-position: center background-size: cover --- # Back home <br><br><br><br> [.center[
<i class="fas fa-home fa-10x faa-FALSE animated "></i>
]](https://github.com/misk-data-science/misk-homl) .center[https://github.com/misk-data-science/misk-homl]