Data pre-processing and engineering techniques generally refer to the addition, deletion, or transformation of data. The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data…or as Leo Breiman said “live with your data before you plunge into modeling” (Breiman and others 2001, 201). Although this course primarily focuses on applying machine learning algorithms, feature engineering can make or break an algorithm’s predictive ability and deserves your continued focus and education.

We will not cover all the potential ways of implementing feature engineering; however, we’ll cover several fundamental pre-processing tasks that has the potential to significantly improve modeling performance. Moreover, different models have different sensitivities to the type of target and feature values in the model and we will try to highlight some of these concerns. For more in depth coverage of feature engineering, please refer to Kuhn and Johnson (2019) and Zheng and Casari (2018).

Learning objectives

By the end of this module you will know:

  • When and how to transform the response variable (“target engineering”).
  • How to identify and deal with missing values.
  • When to filter unnecessary features.
  • Common ways to transform numeric features.
  • Common ways to transform categorical features.
  • How to apply dimension reduction.
  • How to properly combine multiple pre-processing steps into the modeling process.

Prerequisites

# Helper packages
import missingno as msno
import numpy as np
import pandas as pd
from plotnine import ggplot, aes, geom_density, geom_line, geom_point, ggtitle

# Modeling pre-processing with scikit-learn functionality
from sklearn.model_selection import train_test_split
from sklearn.compose import TransformedTargetRegressor
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA

# Modeling pre-processing with non-scikit-learn packages
from category_encoders.ordinal import OrdinalEncoder
from feature_engine.encoding import RareLabelEncoder

# Modeling
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
# Ames housing data
ames = pd.read_csv("data/ames.csv")

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]

# Helper packages
library(tidyverse)
library(naniar)

# Modeling process
library(tidymodels)
# Ames housing data
ames = read_csv("data/ames.csv")

# create training/test split
set.seed(123)
split  <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
train  <- training(split)
test   <- testing(split)

Target engineering

Although not always a requirement, transforming the response variable can lead to predictive improvement, especially with parametric models which require that certain assumptions about the model be met. For instance, ordinary linear regression models assume that the prediction errors (and hence the response) are normally distributed. This is usually fine, except when the prediction target has heavy tails (i.e., outliers) or is skewed in one direction or the other. In these cases, the normality assumption likely does not hold. For example, as we saw in the data splitting section in the last module, the response variable for the Ames housing data (Sale_Price) is right (or positively) skewed ranging from $12,789 to $755,000. A simple linear model, say \(\text{Sale_Price}=\beta_{0} + \beta_{1} \text{Year_Built} + \epsilon\), often assumes the error term \(\epsilon\) (and hence Sale_Price) is normally distributed; fortunately, a simple log (or similar) transformation of the response can often help alleviate this concern as the below plot illustrates.

__Figure__: Transforming the response variable to minimize skewness can resolve concerns with non-normally distributed errors.

Figure: Transforming the response variable to minimize skewness can resolve concerns with non-normally distributed errors.

Furthermore, using a log (or other) transformation to minimize the response skewness can be used for shaping the business problem as well. For example, in the House Prices: Advanced Regression Techniques Kaggle competition1, which used the Ames housing data, the competition focused on using a log transformed Sale Price response because “…taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.” This would be an alternative to using the root mean squared logarithmic error (RMSLE) loss function as discussed in the last module.

There are three common approaches to help correct for positively skewed target variables:

  1. normalize with a log transformation. This will transform most right skewed distributions to be approximately normal.

  2. If your response has negative values or zeros then a log transformation will produce NaNs and -Infs, respectively (you cannot take the logarithm of a negative number). If the nonpositive response values are small (say between -0.99 and 0) then you can apply a small offset such as in log1p() which adds 1 to the value prior to applying a log transformation. If your data consists of values \(\le -1\), use the Yeo-Johnson transformation instead.

  3. (Preferred) Use a Box Cox transformation. A Box Cox transformation is more flexible than (but also includes as a special case) the log transformation and will find an appropriate transformation from a family of power transforms that will transform the variable as close as possible to a normal distribution (Box and Cox 1964; Carroll and Ruppert 1981). At the core of the Box Cox transformation is an exponent, lambda (\(\lambda\)), which varies from -5 to 5. All values of \(\lambda\) are considered and the optimal value for the given data is estimated from the training data; The “optimal value” is the one which results in the best transformation to an approximate normal distribution. The transformation of the response \(Y\) has the form:

\[ \begin{equation} y(\lambda) = \begin{cases} \frac{Y^\lambda-1}{\lambda}, & \text{if}\ \lambda \neq 0 \\ \log\left(Y\right), & \text{if}\ \lambda = 0. \end{cases} \end{equation} \]

If your response has negative values, the Yeo-Johnson transformation is very similar to the Box-Cox but does not require the input variables to be strictly positive.

Below illustrates that the log transformation and Box Cox transformation both do about equally well in transforming Sale_Price to look more normally distributed.

__Figure__: Response variable transformations.

Figure: Response variable transformations.

In Python we use TransformedTargetRegressor() to build a plan for target engineering. This will not return the actual log/box-cox transformed values but, rather, a blueprint to be applied later. In this example we are simply building an object that will apply a Box-Cox transformation to the target variable when we fit our model.

There is a sklearn.preprocessing.power_transform function that can be applied to immediately transform an array. However, later in this module we’ll discuss the idea of data leakage and how important it is to create isolated pre-processing steps.

tt = TransformedTargetRegressor(transformer=PowerTransformer(method='box-cox'))
tt
## TransformedTargetRegressor(transformer=PowerTransformer(method='box-cox'))

In R we use recipes::recipe() to build a “recipe” of pre-processing steps. This will not return the actual log transformed values but, rather, a blueprint to be applied later.

# log transformation
ames_recipe <- recipe(Sale_Price ~ ., data = train) %>%
  step_log(all_outcomes())

ames_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         80
## 
## Operations:
## 
## Log transformation on all_outcomes()

Dealing with missingness

Data quality is an important issue for any project involving analyzing data. Data quality issues deserve an entire book in their own right, and a good reference is The Quartz guide to bad data. One of the most common data quality concerns you will run into is missing values.

Data can be missing for many different reasons; however, these reasons are usually lumped into two categories: informative missingness (Kuhn and Johnson 2013) and missingness at random (Little and Rubin 2014). Informative missingness implies a structural cause for the missing value that can provide insight in its own right; whether this be deficiencies in how the data was collected or abnormalities in the observational environment. Missingness at random implies that missing values occur independent of the data collection process2.

The category that drives missing values will determine how you handle them. For example, we may give values that are driven by informative missingness their own category (e.g., "None") as their unique value may affect predictive performance. Whereas values that are missing at random may deserve deletion3 or imputation.

Furthermore, different machine learning models handle missingness differently. Most algorithms cannot handle missingness (e.g., generalized linear models and their cousins, neural networks, and support vector machines) and, therefore, require them to be dealt with beforehand. A few models (mainly tree-based), have built-in procedures to deal with missing values. However, since the modeling process involves comparing and contrasting multiple models to identify the optimal one, you will want to handle missing values prior to applying any models so that your algorithms are based on the same data quality assumptions.

Visualizing missing values

It is important to understand the distribution of missing values (i.e., NA) in any data set. So far, we have been using a pre-processed version of the Ames housing data set. However, if we use the raw Ames housing data, there are actually 13,997 missing values—there is at least one missing values in each row of the original data! Good visualizations can help us understand patterns in the missing data.

ames_raw = pd.read_csv("data/ames_raw.csv")

# count missing values
ames_raw.isnull().sum()
## Order               0
## PID                 0
## MS SubClass         0
## MS Zoning           0
## Lot Frontage      490
##                  ... 
## Mo Sold             0
## Yr Sold             0
## Sale Type           0
## Sale Condition      0
## SalePrice           0
## Length: 82, dtype: int64

Check out the missingno package to see all the great ways to to visualize missing data!

# can you identify patterns of missing data
# missingness is represented with white
msno.matrix(ames_raw, labels=True, filter="bottom", sort="ascending", n=50)

# which features have most missing?
# this chart shows the number of observations so small bars (i.e. Pool QC)
# represent very few observed values (lots of missingness)
msno.bar(ames_raw, labels=True, filter="bottom", sort="ascending", n=50)

ames_raw <- read_csv("data/ames_raw.csv")

# total missing values
sum(is.na(ames_raw))
## [1] 14010

Check out the naniar package to see all the great ways to to visualize missing data!

# can you identify patterns of missing data?
# missingness is represented with black
vis_miss(ames_raw)

# which features have most missing?
gg_miss_var(ames_raw)

Imputation

Imputation is the process of replacing a missing value with a substituted, “best guess” value. Imputation should be one of the first feature engineering steps you take as it will affect any downstream pre-processing4.

Estimated statistic

An elementary approach to imputing missing values for a feature is to compute descriptive statistics such as the mean, median, or mode (for categorical) and use that value to replace NAs. Although computationally efficient, this approach does not consider any other attributes for a given observation when imputing (e.g., a female patient that is 63 inches tall may have her weight imputed as 175 lbs since that is the average weight across all observations which contains 65% males that have an average a height of 70 inches).

An alternative is to use grouped statistics to capture expected values for observations that fall into similar groups. However, this becomes infeasible for larger data sets. Modeling imputation can automate this process for you and the two most common methods include K-nearest neighbor and tree-based imputation, which are discussed next.

However, it is important to remember that imputation should be performed within the resampling process and as your data set gets larger, repeated model-based imputation can compound the computational demands. Thus, you must weigh the pros and cons of the two approaches.

The following code snippet shows how to impute different features with the median value of a given feature.

These imputers do not yet perform imputation. At the end of this module we will show you how all these pieces of the feature engineering come together in our ML modeling pipeline.

In Python, the SimpleImputer can be used to apply an imputation to all features. However, if you only want to apply imputation to a subset of features then you would use ColumnTransformer in addition to the SimpleImputer strategy.

# median imputation to all features
a = SimpleImputer(strategy='median')

# median imputation to just numeric predictors
b = ColumnTransformer([("num_imp", a, selector(dtype_include="number"))])

# median imputation to 1 or more features
c = ColumnTransformer([("num_imp", a, selector("Gr_Liv_Area"))])

See more imputation options here.

# median imputation to all features
a <- ames_recipe %>%
  step_medianimpute(all_predictors())

# median imputation to just numeric predictors
b <- ames_recipe %>%
  step_medianimpute(all_predictors(), all_numeric())

# median imputation to 1 or more features
c <- ames_recipe %>%
  step_medianimpute(Gr_Liv_Area)

K-nearest neighbor

K-nearest neighbor (KNN) imputes values by identifying observations with missing values, then identifying other observations that are most similar based on the other available features, and using the values from these nearest neighbor observations to impute missing values.

We discuss KNN for predictive modeling in a later module; the imputation application works in a similar manner. In KNN imputation, the missing value for a given observation is treated as the targeted response and is predicted based on the average (for quantitative values) or the mode (for qualitative values) of the k nearest neighbors.

As discussed in the KNN modeling module, if all features are quantitative then standard Euclidean distance is commonly used as the distance metric to identify the k neighbors and when there is a mixture of quantitative and qualitative features then Gower’s distance (Gower 1971) can be used. KNN imputation is best used on small to moderate sized data sets as it becomes computationally burdensome with larger data sets (Kuhn and Johnson 2019).

As we saw in the last module, k is a tunable hyperparameter. Suggested values for imputation are 5–10 (Kuhn and Johnson 2019).

knn_imp = KNNImputer(n_neighbors=6)

knn_imp <- ames_recipe %>%
  step_knnimpute(all_predictors(), neighbors = 6)

Tree-based

Several implementations of decision trees and their derivatives can be constructed in the presence of missing values. Thus, they provide a good alternative for imputation. As discussed in later modules, single trees have high variance but aggregating across many trees creates a robust, low variance predictor. Random forest imputation procedures have been studied (Shah et al. 2014; Stekhoven 2015); however, they require significant computational demands in a resampling environment (Kuhn and Johnson 2019). Bagged trees offer a compromise between predictive accuracy and computational burden.

Don’t worry, we’ll discuss decision trees and variants of them indepth in later modules.

Similar to KNN imputation, observations with missing values are identified and the feature containing the missing value is treated as the target and predicted using bagged decision trees.

Unfortunately Python does not provide a tree-based approach for imputation. This would have to be done manually. However, KNN & tree-based imputation tends to perform similarly.

ames_recipe %>%
  step_bagimpute(all_predictors())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         80
## 
## Operations:
## 
## Log transformation on all_outcomes()
## Bagged tree imputation for all_predictors()

Which to use?

The plot below illustrates the differences between mean, KNN, and tree-based imputation on the raw Ames housing data. It is apparent how descriptive statistic methods (e.g., using the mean and median) are inferior to the KNN and tree-based imputation methods.

When possible use KNN or tree-based imputation. When data sets become extremely large tree-based imputation can be more efficient.

__Figure__: Comparison of three different imputation methods. The red points represent actual values which were removed and made missing and the blue points represent the imputed values. Estimated statistic imputation methods (i.e. mean, median) merely predict the same value for each observation and can reduce the signal between a feature and the response; whereas KNN and tree-based procedures tend to maintain the feature distribution and relationship.

Figure: Comparison of three different imputation methods. The red points represent actual values which were removed and made missing and the blue points represent the imputed values. Estimated statistic imputation methods (i.e. mean, median) merely predict the same value for each observation and can reduce the signal between a feature and the response; whereas KNN and tree-based procedures tend to maintain the feature distribution and relationship.

Feature filtering

In many data analyses and modeling projects we end up with hundreds or even thousands of collected features. From a practical perspective, a model with more features often becomes harder to interpret and is costly to compute. Some models are more resistant to non-informative predictors (e.g., the Lasso and tree-based methods) than others as illustrated below.5

__Figure__: Test set RMSE profiles when non-informative predictors are added.

Figure: Test set RMSE profiles when non-informative predictors are added.

Although the performance of some of our models are not significantly affected by non-informative predictors, the time to train these models can be negatively impacted as more features are added. The plot below shows the increase in time to perform 10-fold CV on the exemplar data, which consists of 10,000 observations. We see that many algorithms (e.g., elastic nets, random forests, and gradient boosting machines) become extremely time intensive the more predictors we add. Consequently, filtering or reducing features prior to modeling may significantly speed up training time.

__Figure__: Impact in model training time as non-informative predictors are added.

Figure: Impact in model training time as non-informative predictors are added.

Zero and near-zero variance variables are low-hanging fruit to eliminate. Zero variance variables, meaning the feature only contains a single unique value, provides no useful information to a model. Some algorithms are unaffected by zero variance features. However, features that have near-zero variance also offer very little, if any, information to a model. Furthermore, they can cause problems during resampling as there is a high probability that a given sample will only contain a single unique value (the dominant value) for that feature.

A rule of thumb for detecting near-zero variance features are identifying and removing features with \(\leq 5-10\)% variance.

For the Ames data, we do not have any zero variance predictors but there are many features that meet the near-zero threshold. The following shows all features where there is less than 5% variance in distinct values.

##               rowname  freqRatio percentUnique zeroVar  nzv
## 1              Street  226.66667    0.09760859   FALSE TRUE
## 2               Alley   24.25316    0.14641288   FALSE TRUE
## 3           Utilities 1023.00000    0.14641288   FALSE TRUE
## 4          Land_Slope   22.15909    0.14641288   FALSE TRUE
## 5        Land_Contour   19.50000    0.19521718   FALSE TRUE
## 6       Kitchen_AbvGr   21.23913    0.19521718   FALSE TRUE
## 7             Pool_QC  509.75000    0.24402147   FALSE TRUE
## 8        Misc_Feature   34.18966    0.24402147   FALSE TRUE
## 9           Bsmt_Cond   20.24444    0.29282577   FALSE TRUE
## 10            Heating  106.00000    0.29282577   FALSE TRUE
## 11        Condition_2  202.60000    0.34163006   FALSE TRUE
## 12     BsmtFin_Type_2   25.85294    0.34163006   FALSE TRUE
## 13          Roof_Matl  144.35714    0.39043436   FALSE TRUE
## 14         Functional   38.89796    0.39043436   FALSE TRUE
## 15          Pool_Area 2039.00000    0.53684724   FALSE TRUE
## 16 Three_season_porch  673.66667    1.12249878   FALSE TRUE
## 17    Low_Qual_Fin_SF 1010.50000    1.31771596   FALSE TRUE
## 18           Misc_Val  180.54545    1.56173743   FALSE TRUE
## 19       Screen_Porch  169.90909    4.63640800   FALSE TRUE

Other feature filtering methods exist; see Saeys, Inza, and Larrañaga (2007) for a thorough review. Furthermore, several wrapper methods exist that evaluate multiple models using procedures that add or remove predictors to find the optimal combination of features that maximizes model performance (see, for example, Kursa, Rudnicki, and others (2010), Granitto et al. (2006), Maldonado and Weber (2009)). However, this topic is beyond the scope of this course.

See other feature filtering functions here.

nzv = VarianceThreshold(threshold=0.1)

See other feature filtering functions here.

nzv <- ames_recipe %>%
  step_nzv(all_predictors(), unique_cut = 10) # threshold as % (10%)

Numeric feature engineering

Numeric features can create a host of problems for certain models when their distributions are skewed, contain outliers, or have a wide range in magnitudes. Tree-based models are quite immune to these types of problems in the feature space, but many other models (e.g., GLMs, regularized regression, KNN, support vector machines, neural networks) can be greatly hampered by these issues. Normalizing and standardizing heavily skewed features can help minimize these concerns.

Skewness

Similar to the process discussed to normalize target variables, parametric models that have distributional assumptions (e.g., GLMs, and regularized models) can benefit from minimizing the skewness of numeric features. When normalizing many variables, it’s best to use the Box-Cox (when feature values are strictly positive) or Yeo-Johnson (when feature values are not strictly positive) procedures as these methods will identify if a transformation is required and what the optimal transformation will be.

Non-parametric models are rarely affected by skewed features; however, normalizing features will not have a negative effect on these models’ performance. For example, normalizing features will only shift the optimal split points in tree-based algorithms. Consequently, when in doubt, normalize.

# Normalizing approach
yj = PowerTransformer(method="yeo-johnson")

# Normalize all numeric features
X_norm = ColumnTransformer([("norm", yj, selector(dtype_include="number"))])

# Normalize all numeric features
X_norm <- ames_recipe %>%
  step_YeoJohnson(all_predictors(), all_numeric())                 

Standardization

We must also consider the scale on which the individual features are measured. What are the largest and smallest values across all features and do they span several orders of magnitude? Models that incorporate smooth functions of input features are sensitive to the scale of the inputs. For example, \(5X+2\) is a simple linear function of the input X, and the scale of its output depends directly on the scale of the input. Many algorithms use linear functions within their algorithms, some more obvious (e.g., GLMs and regularized regression) than others (e.g., neural networks, support vector machines, and principal components analysis). Other examples include algorithms that use distance measures such as the Euclidean distance (e.g., k nearest neighbor, k-means clustering, and hierarchical clustering).

For these models and modeling components, it is often a good idea to standardize the features. Standardizing features includes centering and scaling so that numeric variables have zero mean and unit variance, which provides a common comparable unit of measure across all the variables.

__Figure__: Standardizing features allows all features to be compared on a common value scale regardless of their real value differences.

Figure: Standardizing features allows all features to be compared on a common value scale regardless of their real value differences.

# Normalizing approach
scaler = StandardScaler()

# standardize all numeric features
std = ColumnTransformer([("norm", scaler, selector(dtype_include="number"))])

# standardize all numeric features
std <- ames_recipe %>%
  step_center(predictors(), all_numeric()) %>%
  step_scale(predictors(), all_numeric())

Categorical feature engineering

Most models require that the predictors take numeric form. There are exceptions; for example, tree-based models naturally handle numeric or categorical features. However, even tree-based models can benefit from pre-processing categorical features. The following sections will discuss a few of the more common approaches to engineer categorical features.

One-hot & dummy encoding

There are many ways to recode categorical variables as numeric. The most common is referred to as one-hot encoding, where we transpose our categorical variables so that each level of the feature is represented as a boolean value. For example, one-hot encoding the left data frame in the below figure results in X being converted into three columns, one for each level. This is called less than full rank encoding . However, this creates perfect collinearity which causes problems with some predictive modeling algorithms (e.g., ordinary linear regression and neural networks). Alternatively, we can create a full-rank encoding by dropping one of the levels (level c has been dropped). This is referred to as dummy encoding.

__Figure__: Eight observations containing a categorical feature X and the difference in how one-hot and dummy encoding transforms this feature.

Figure: Eight observations containing a categorical feature X and the difference in how one-hot and dummy encoding transforms this feature.

# one-hot encoder
encoder = OneHotEncoder()

# apply to all categorical features
ohe = ColumnTransformer([("one-hot", encoder, selector(dtype_include="object"))])

# dummy encode
encoder = OneHotEncoder(drop='first')

# apply to all categorical features
de = ColumnTransformer([("dummy", encoder, selector(dtype_include="object"))])

# one-hot encode
ohe <- ames_recipe %>%
  step_dummy(all_nominal(), one_hot = TRUE)

# dummy encode
de <- ames_recipe %>%
  step_dummy(all_nominal(), one_hot = FALSE)

Label encoding

Label encoding is a pure numeric conversion of the levels of a categorical variable. For example, the MS_SubClass variable has 16 levels, which if recoded with label encoding would convert them to integers based on alphabetical order of the feature values.

## # A tibble: 15 x 2
##    original_levels                           label_encoded_levels
##    <chr>                                                    <dbl>
##  1 Duplex_All_Styles_and_Ages                                   1
##  2 One_and_Half_Story_Finished_All_Ages                         2
##  3 One_and_Half_Story_Unfinished_All_Ages                       3
##  4 One_Story_1945_and_Older                                     4
##  5 One_Story_1946_and_Newer_All_Styles                          5
##  6 One_Story_PUD_1946_and_Newer                                 6
##  7 One_Story_with_Finished_Attic_All_Ages                       7
##  8 PUD_Multilevel_Split_Level_Foyer                             8
##  9 Split_Foyer                                                  9
## 10 Split_or_Multilevel                                         10
## 11 Two_and_Half_Story_All_Ages                                 11
## 12 Two_Family_conversion_All_Styles_and_Ages                   12
## 13 Two_Story_1945_and_Older                                    13
## 14 Two_Story_1946_and_Newer                                    14
## 15 Two_Story_PUD_1946_and_Newer                                15

# create encoder
encoder = LabelEncoder()

# Label encode a single column
lbl = ColumnTransformer([("label", encoder, "MS_SubClass")])

# Label encode a single column
lbl <- ames_recipe %>%
  step_integer(MS_SubClass)

Ordinal encoding

We should be careful with label encoding unordered categorical features because most models will treat them as ordered numeric features. If a categorical feature is naturally ordered then label encoding is a natural choice (most commonly referred to as ordinal encoding). For example, the various quality features in the Ames housing data are ordinal in nature (ranging from Very_Poor to Very_Excellent).

## # A tibble: 2,049 x 5
##    Overall_Qual  Exter_Qual Bsmt_Qual   Kitchen_Qual Garage_Qual
##    <chr>         <chr>      <chr>       <chr>        <chr>      
##  1 Average       Typical    Typical     Typical      Typical    
##  2 Above_Average Typical    Typical     Typical      Typical    
##  3 Good          Typical    Good        Typical      Typical    
##  4 Above_Average Typical    Good        Typical      Typical    
##  5 Average       Typical    Typical     Typical      Typical    
##  6 Above_Average Typical    No_Basement Typical      Typical    
##  7 Average       Typical    Typical     Typical      Typical    
##  8 Average       Typical    Typical     Typical      Typical    
##  9 Average       Typical    Typical     Typical      No_Garage  
## 10 Average       Typical    Typical     Typical      Typical    
## # … with 2,039 more rows

Ordinal encoding these features provides a natural and intuitive interpretation and can logically be applied to all models.

Scikit-learn has an OrdinalEncoder; however, it is not very flexible in allowing you to easily specify levels across many features. However, the Category Encoders package provides several scikit-learn style transformers including an OrdinalEncoder that is more flexible.

# ID all quality features to ordinal encode
cols = list(X_train.filter(regex=("Qual$|QC$|Cond$")).columns)

# specify levels in order
lvs = ["Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical", 
       "Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"]
val = range(0, len(lvs))

# create a level to integer mapping
lvl_map = dict(zip(lvs, val))
category_mapping = [{'col': col, 'mapping': lvl_map} for col in cols]

# example of first two mappings
category_mapping[0:2]
## [{'col': 'Overall_Qual', 'mapping': {'Very_Poor': 0, 'Poor': 1, 'Fair': 2, 'Below_Average': 3, 'Average': 4, 'Typical': 5, 'Above_Average': 6, 'Good': 7, 'Very_Good': 8, 'Excellent': 9, 'Very_Excellent': 10}}, {'col': 'Overall_Cond', 'mapping': {'Very_Poor': 0, 'Poor': 1, 'Fair': 2, 'Below_Average': 3, 'Average': 4, 'Typical': 5, 'Above_Average': 6, 'Good': 7, 'Very_Good': 8, 'Excellent': 9, 'Very_Excellent': 10}}]
# Apply ordinal encoder
cat_encoder = OrdinalEncoder(cols=cols, mapping=category_mapping)

OrdinalEncoder will automatically convert any levels not in the specified levels to a -1.

# specify levels in order
lvls <- c("Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical", 
          "Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent")

# apply ordinal encoding to quality features
ord_lbl <- ames_recipe %>%
  # 1. convert quality features to factors
  step_string2factor(
    ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
    levels = lvls
    ) %>% 
  # 2. convert any missed levels (i.e. no_pool, no_garage,) to "None"
  step_unknown(
    ends_with("Qual"), ends_with("QC"), ends_with("_Cond"), 
    new_level = "None"
    ) %>%
  # 3. move "None" level to front ('None', 'Very_Poor', ..., 'Very_Excellent')
  step_relevel(
    ends_with("Qual"), ends_with("QC"), ends_with("_Cond"), 
    ref_level = "None"
    ) %>%
  step_integer(ends_with("Qual"), ends_with("QC"), ends_with("_Cond"))

Lumping

Sometimes features will contain levels that have very few observations. For example, there are 28 unique neighborhoods represented in the Ames housing data but several of them only have a few observations.

## # A tibble: 28 x 2
##    Neighborhood            n
##    <chr>               <int>
##  1 Landmark                1
##  2 Green_Hills             2
##  3 Greens                  3
##  4 Blueste                 8
##  5 Veenker                15
##  6 Northpark_Villa        17
##  7 Bloomington_Heights    18
##  8 Meadow_Village         22
##  9 Briardale              23
## 10 Clear_Creek            26
## # … with 18 more rows

Sometimes we can benefit from collapsing, or “lumping” these into a lesser number of categories. In the above examples, we may want to collapse all levels that are observed in less than 10% of the training sample into an “other” category. We can use step_other() to do so. However, lumping should be used sparingly as there is often a loss in model performance (Kuhn and Johnson 2013).

Tree-based models often perform exceptionally well with high cardinality features and are not as impacted by levels with small representation.

The following code snippets show how to lump all neighborhoods that represent less than 1% of observations into an “other” category.

# create rare label encoder
rare_encoder = RareLabelEncoder(tol=0.01, replace_with="other")
# demonstrate how some neighborhoods are now represented by "other"
rare_encoder.fit_transform(X_train)["Neighborhood"].unique()
## array(['Mitchell', 'Sawyer', 'Edwards', 'College_Creek', 'Meadow_Village',
##        'Northridge_Heights', 'Northwest_Ames', 'Northridge', 'Brookside',
##        'Old_Town', 'Sawyer_West', 'Gilbert',
##        'South_and_West_of_Iowa_State_University', 'Somerset', 'Crawford',
##        'North_Ames', 'Clear_Creek', 'Iowa_DOT_and_Rail_Road',
##        'Timberland', 'other', 'Stone_Brook', 'Bloomington_Heights'],
##       dtype=object)

# Lump levels for two features
rare_encoder <- ames_recipe %>%
  step_other(Neighborhood, threshold = 0.01, other = "other")

Alternatives

There are several alternative categorical encodings that are implemented in both R and Python that are worth exploring. For example, target encoding is the process of replacing a categorical value with the mean (regression) or proportion (classification) of the target variable. Count/frequency encoding replaces categories with the number of observations or percentage of observations per category. Hash encoding converts strings to arbitrary digits (similar to label encoding but more robust).

Dimension reduction

Dimension reduction is an alternative approach to filter out non-informative features without manually removing them. This course does not focus indepth on dimension reduction; however, we wanted to highlight that it is very common to include these types of dimension reduction approaches during the feature engineering process. For example, we may wish to reduce the dimension of our features with principal components analysis and retain the number of components required to explain, say, 95% of the variance and use these components as features in downstream modeling.

# PCA object - keep 25 components
pca = PCA(n_components=25)

# apply PCA to all numeric features
pca_encoder = ColumnTransformer([("pca", pca, selector(dtype_include="number"))])

pca_encoder <- ames_recipe %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  step_pca(all_numeric(), threshold = .95)

Proper implementation

We stated at the beginning of this module that we should think of feature engineering as creating a blueprint rather than manually performing each task individually. This helps us in two ways: (1) thinking sequentially and (2) to apply appropriately within the resampling process.

Sequential steps

Thinking of feature engineering as a blueprint forces us to think of the ordering of our pre-processing steps. Although each particular problem requires you to think of the effects of sequential pre-processing, there are some general suggestions that you should consider:

  • If using a log or Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. Alternatively, use the Yeo-Johnson transformation so you don’t have to worry about this.
  • One-hot or dummy encoding typically results in sparse data which many algorithms can operate efficiently on. If you standardize sparse data you will create dense data and you loose the computational efficiency. Consequently, it’s often preferred to standardize your numeric features and then one-hot/dummy encode.
  • If you are lumping infrequently occurring categories together, do so before one-hot/dummy encoding.
  • Although you can perform dimension reduction procedures on categorical features, it is common to primarily do so on numeric features when doing so for feature engineering purposes.

While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems:

  1. Filter out zero or near-zero variance features.
  2. Perform imputation if required.
  3. Normalize to resolve numeric feature skewness.
  4. Standardize (center and scale) numeric features.
  5. Perform dimension reduction (e.g., PCA) on numeric features.
  6. One-hot or dummy encode categorical features.

Data leakage

Data leakage is when information from outside the training data set is used to create the model. Data leakage often occurs during the data pre-processing period. To minimize this, feature engineering should be done in isolation of each resampling iteration. Recall that resampling allows us to estimate the generalizable prediction error. Therefore, we should apply our feature engineering blueprint to each resample independently as illustrated below. That way we are not leaking information from one data set to another (each resample is designed to act as isolated training and test data).

__Figure__: Performing feature engineering pre-processing within each resample helps to minimize data leakage.

Figure: Performing feature engineering pre-processing within each resample helps to minimize data leakage.

For example, when standardizing numeric features, each resampled training data should use its own mean and variance estimates and these specific values should be applied to the same resampled test set. This imitates how real-life prediction occurs where we only know our current data’s mean and variance estimates; therefore, on new data that comes in where we need to predict we assume the feature values follow the same distribution of what we’ve seen in the past.

Putting the process together

To illustrate how this process works together via R code, let’s do a simple re-assessment on the ames data set that we did at the end of the last module. However, this time we will start with all features and apply a sequence of simple feature engineering steps. Specifically, our feature engineering steps will include:

  1. Remove near-zero variance features that are categorical (aka nominal).
  2. Ordinal encode our quality-based features (which are inherently ordinal).
  3. Center and scale (i.e., standardize) all numeric features.
  4. Perform dimension reduction by applying PCA to all numeric features.
  5. Collapse (aka lump) categorical levels with small representation.
  6. One-hot encode remaining categorical features.

First, let’s create our

  1. train/test split
  2. resampling procedure
  3. model object
  4. hyperparameter grid

These items are nothing new and similar to our approach in the last module.

# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)

# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train["Sale_Price"]

# create KNN model object
knn = KNeighborsRegressor()

# define loss function
loss = 'neg_root_mean_squared_error'

# create 10 fold CV object
kfold = KFold(n_splits=10, random_state=123, shuffle=True)

# Create grid of hyperparameter values
hyper_grid = {'knn__n_neighbors': range(2, 26)}

Next, we create our feature engineering recipe with the 6 steps mentioned above. In Python we can use the ColumnTransformer to specify the steps in order. ColumnTransformer takes a list of tuples (<step name>, <encoder>, <columns to apply encoder to>) in the order that you want to apply them. The remainder parameter allows you to keep any columns not mentioned in the encoding steps otherwise by default they will be dropped.

# 1. Remove near-zero variance features that are categorical  
nzv_encoder = VarianceThreshold(threshold=0.1)

# 2. Ordinal encode our quality-based features 
ord_cols = list(X_train.filter(regex=("Qual$|QC$|Cond$")).columns)
lvs = ["Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical", 
       "Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"]
val = range(0, len(lvs))
lvl_map = dict(zip(lvs, val))
category_mapping = [{'col': col, 'mapping': lvl_map} for col in ord_cols]
ord_encoder = OrdinalEncoder(cols=ord_cols, mapping=category_mapping)

# 3. Center and scale (i.e., standardize) all numeric features
scaler = StandardScaler()

# 4. Perform dimension reduction by applying PCA to all numeric features
pca = PCA(n_components=30)

# 5. One-hot encode remaining categorical features.
encoder = OneHotEncoder(handle_unknown="ignore")

# combine all steps into a pre-processing pipeline
preprocessor = ColumnTransformer(
  remainder="passthrough",
  transformers=[
  ("nzv_encode", nzv_encoder, selector(dtype_include="number")),
  ("ord_encode", ord_encoder, ord_cols),
  ("std_encode", scaler, selector(dtype_include="number")),
  ("pca_encode", pca, selector(dtype_include="number")),
  ("one-hot", encoder, selector(dtype_include="object")),
  ])

We can now create a modeling pipeline that includes the pre-processing step and then the modeling step. Similar to ColumnTransformer, Pipeline takes tuples with the name and object for each step in the order you want them applied.

model_pipeline = Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("knn", knn),
])

Now we can build our GridSearchCV object with our model process objects. This is where the feature engineering actually happens as grid_search.fit will apply the feature engineering steps individually within each of the kfold resamples so there is no data leakage.

# Tune a knn model using grid search
grid_search = GridSearchCV(model_pipeline, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train, y_train)

# Best model's cross validated RMSE
abs(results.best_score_)
## 41915.58187389292
# Best model's k value
results.best_estimator_.get_params().get('knn__n_neighbors')
## 5
# Plot all RMSE results
all_rmse = pd.DataFrame({'k': range(2, 26), 
                         'RMSE': np.abs(results.cv_results_['mean_test_score'])})

(ggplot(all_rmse, aes(x='k', y='RMSE'))
 + geom_line()
 + geom_point()
 + ggtitle("Cross validated grid search results"))
## <ggplot: (320319648)>

First, let’s create our

  1. train/test split
  2. resampling procedure
  3. model object
  4. hyperparameter grid

These items are nothing new and similar to our approach in the last module.

# create train/test split
set.seed(123)  # for reproducibility
split  <- initial_split(ames, prop = 0.7)
train  <- training(split)
test   <- testing(split)

# create resampling procedure
kfold <- vfold_cv(train, v = 10)

# model object
knn <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("regression")

# Create grid of hyperparameter values
hyper_grid <- expand.grid(neighbors = seq(2, 25, by = 1))

Next, we create our model and feature engineering recipe with the 6 steps mentioned above.

# model recipe with feature engineering steps
model_form <- recipe(Sale_Price ~ ., data = train) %>%
  # 1. Remove near-zero variance features that are categorical 
  step_nzv(all_nominal()) %>%
  # 2. Ordinal encode our quality-based features
  step_string2factor(
    ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
    levels = lvls
    ) %>% 
  step_unknown(
    ends_with("Qual"), ends_with("QC"), ends_with("_Cond"), 
    new_level = "None"
    ) %>%
  step_relevel(
    ends_with("Qual"), ends_with("QC"), ends_with("_Cond"), 
    ref_level = "None"
    ) %>%
  step_integer(ends_with("Qual"), ends_with("QC"), ends_with("_Cond")) %>%
  # 3. Center and scale (i.e., standardize) all numeric features
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  # 4. Perform dimension reduction by applying PCA to all numeric features.
  step_pca(all_numeric(), -all_outcomes(), num_comp = 30) %>%
  # 5. Collapse categorical levels with small representation
  step_other(all_nominal(), threshold = 0.01, other = "other") %>%
  # 6. One-hot encode the remaining categorical features.
  step_dummy(all_nominal(), one_hot = TRUE)

Now we can run tune_grid and supply it all our objects. This is where the feature engineering actually happens as tune_grid will apply the feature engineering steps individually within each of the kfold resamples so there is no data leakage.

# Tune a knn model using grid search
results <- tune_grid(knn, model_form, resamples = kfold, grid = hyper_grid)

# best model
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
##   neighbors .metric .estimator   mean     n std_err .config              
##       <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
## 1        14 rmse    standard   39234.     9   1977. Preprocessor1_Model13
## 2        15 rmse    standard   39235.     9   1981. Preprocessor1_Model14
## 3        16 rmse    standard   39246.     9   1985. Preprocessor1_Model15
## 4        13 rmse    standard   39252.     9   1972. Preprocessor1_Model12
## 5        17 rmse    standard   39264.     9   1990. Preprocessor1_Model16

# plot results
results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(neighbors, mean)) +
  geom_line() +
  geom_point() +
  labs(x = "k", y = "RMSE", title = "Cross validated grid search results")

Exercises

Using the Ames dataset and the same approach shown in the last section:

  1. Rather than use a 70% stratified training split, try an 80% unstratified training split. How does your cross-validated results compare?

  2. Rather than numerically encode the quality and condition features (i.e. step_integer(matches("Qual|Cond|QC"))), one-hot encode these features. What is the difference in the number of features in your training set? Apply the same cross-validated KNN model to this new feature set. How does the performance change? How does the training time change?

  3. Identify three new feature engineering steps that are provided by recipes, scikit-learn or some other open source R/Python package:

    • Why would these feature engineering steps be applicable to the Ames data?
    • Apply these feature engineering steps along with the same cross-validated KNN model. How do your results change?
  4. Compare the Python and R pre-processing steps in the last section. Can you spot any discrepencies? If so, resolve them so the pipelines are performing the same steps.

  5. Using the Attrition data set, assess the characteristics of the target and features.

    • Which target/feature engineering steps should be applied?
    • Create a feature engineering pipeline and apply a KNN grid search. What is the performance of your model?

🏠

References

Box, George EP, and David R Cox. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society. Series B (Methodological), 211–52.
Breiman, Leo, and others. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231.
Carroll, Raymond J, and David Ruppert. 1981. “On Prediction and the Power Transformation Family.” Biometrika 68 (3): 609–15.
Gower, John C. 1971. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics, 857–71.
Granitto, Pablo M, Cesare Furlanello, Franco Biasioli, and Flavia Gasperi. 2006. “Recursive Feature Elimination with Random Forest for PTR-MS Analysis of Agroindustrial Products.” Chemometrics and Intelligent Laboratory Systems 83 (2): 83–90.
Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.
———. 2019. Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman & Hall/CRC.
Kursa, Miron B, Witold R Rudnicki, and others. 2010. “Feature Selection with the Boruta Package.” J Stat Softw 36 (11): 1–13.
Little, Roderick JA, and Donald B Rubin. 2014. Statistical Analysis with Missing Data. Vol. 333. John Wiley & Sons.
Maldonado, Sebastián, and Richard Weber. 2009. “A Wrapper Method for Feature Selection Using Support Vector Machines.” Information Sciences 179 (13): 2208–17.
Saeys, Yvan, Iñaki Inza, and Pedro Larrañaga. 2007. “A Review of Feature Selection Techniques in Bioinformatics.” Bioinformatics 23 (19): 2507–17.
Shah, Anoop D, Jonathan W Bartlett, James Carpenter, Owen Nicholas, and Harry Hemingway. 2014. “Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study.” American Journal of Epidemiology 179 (6): 764–74.
Stekhoven, Daniel J. 2015. “missForest: Nonparametric Missing Value Imputation Using Random Forest.” Astrophysics Source Code Library.
Zheng, Alice, and Amanda Casari. 2018. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc.

  1. https://www.kaggle.com/c/house-prices-advanced-regression-techniques↩︎

  2. Little and Rubin (2014) discuss two different kinds of missingness at random; however, we combine them for simplicity as their nuanced differences are distinguished between the two in practice.↩︎

  3. If your data set is large, deleting missing observations that have missing values at random rarely impacts predictive performance. However, as your data sets get smaller, preserving observations is critical and alternative solutions should be explored.↩︎

  4. For example, standardizing numeric features will include the imputed numeric values in the calculation and one-hot encoding will include the imputed categorical value.↩︎

  5. See Kuhn and Johnson (2013) Section 19.1 for data set generation.↩︎