Data pre-processing and engineering techniques generally refer to the addition, deletion, or transformation of data. The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data…or as Leo Breiman said “live with your data before you plunge into modeling” (Breiman and others 2001, 201). Although this course primarily focuses on applying machine learning algorithms, feature engineering can make or break an algorithm’s predictive ability and deserves your continued focus and education.
We will not cover all the potential ways of implementing feature engineering; however, we’ll cover several fundamental pre-processing tasks that has the potential to significantly improve modeling performance. Moreover, different models have different sensitivities to the type of target and feature values in the model and we will try to highlight some of these concerns. For more in depth coverage of feature engineering, please refer to Kuhn and Johnson (2019) and Zheng and Casari (2018).
By the end of this module you will know:
# Helper packages
import missingno as msno
import numpy as np
import pandas as pd
from plotnine import ggplot, aes, geom_density, geom_line, geom_point, ggtitle
# Modeling pre-processing with scikit-learn functionality
from sklearn.model_selection import train_test_split
from sklearn.compose import TransformedTargetRegressor
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
# Modeling pre-processing with non-scikit-learn packages
from category_encoders.ordinal import OrdinalEncoder
from feature_engine.encoding import RareLabelEncoder
# Modeling
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
# Ames housing data
ames = pd.read_csv("data/ames.csv")
# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)
# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train[["Sale_Price"]]
# Helper packages
library(tidyverse)
library(naniar)
# Modeling process
library(tidymodels)
# Ames housing data
ames = read_csv("data/ames.csv")
# create training/test split
set.seed(123)
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
train <- training(split)
test <- testing(split)
Although not always a requirement, transforming the response variable can lead to predictive improvement, especially with parametric models which require that certain assumptions about the model be met. For instance, ordinary linear regression models assume that the prediction errors (and hence the response) are normally distributed. This is usually fine, except when the prediction target has heavy tails (i.e., outliers) or is skewed in one direction or the other. In these cases, the normality assumption likely does not hold. For example, as we saw in the data splitting section in the last module, the response variable for the Ames housing data (Sale_Price
) is right (or positively) skewed ranging from $12,789 to $755,000. A simple linear model, say \(\text{Sale_Price}=\beta_{0} + \beta_{1} \text{Year_Built} + \epsilon\), often assumes the error term \(\epsilon\) (and hence Sale_Price
) is normally distributed; fortunately, a simple log (or similar) transformation of the response can often help alleviate this concern as the below plot illustrates.
Furthermore, using a log (or other) transformation to minimize the response skewness can be used for shaping the business problem as well. For example, in the House Prices: Advanced Regression Techniques Kaggle competition1, which used the Ames housing data, the competition focused on using a log transformed Sale Price response because “…taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.” This would be an alternative to using the root mean squared logarithmic error (RMSLE) loss function as discussed in the last module.
There are three common approaches to help correct for positively skewed target variables:
normalize with a log transformation. This will transform most right skewed distributions to be approximately normal.
If your response has negative values or zeros then a log transformation will produce NaN
s and -Inf
s, respectively (you cannot take the logarithm of a negative number). If the nonpositive response values are small (say between -0.99 and 0) then you can apply a small offset such as in log1p()
which adds 1 to the value prior to applying a log transformation. If your data consists of values \(\le -1\), use the Yeo-Johnson transformation instead.
(Preferred) Use a Box Cox transformation. A Box Cox transformation is more flexible than (but also includes as a special case) the log transformation and will find an appropriate transformation from a family of power transforms that will transform the variable as close as possible to a normal distribution (Box and Cox 1964; Carroll and Ruppert 1981). At the core of the Box Cox transformation is an exponent, lambda (\(\lambda\)), which varies from -5 to 5. All values of \(\lambda\) are considered and the optimal value for the given data is estimated from the training data; The “optimal value” is the one which results in the best transformation to an approximate normal distribution. The transformation of the response \(Y\) has the form:
If your response has negative values, the Yeo-Johnson transformation is very similar to the Box-Cox but does not require the input variables to be strictly positive.
Below illustrates that the log transformation and Box Cox transformation both do about equally well in transforming Sale_Price
to look more normally distributed.
In Python we use TransformedTargetRegressor()
to build a plan for target engineering. This will not return the actual log/box-cox transformed values but, rather, a blueprint to be applied later. In this example we are simply building an object that will apply a Box-Cox transformation to the target variable when we fit our model.
There is a sklearn.preprocessing.power_transform
function that can be applied to immediately transform an array. However, later in this module we’ll discuss the idea of data leakage and how important it is to create isolated pre-processing steps.
tt = TransformedTargetRegressor(transformer=PowerTransformer(method='box-cox'))
tt
## TransformedTargetRegressor(transformer=PowerTransformer(method='box-cox'))
In R we use recipes::recipe()
to build a “recipe” of pre-processing steps. This will not return the actual log transformed values but, rather, a blueprint to be applied later.
# log transformation
ames_recipe <- recipe(Sale_Price ~ ., data = train) %>%
step_log(all_outcomes())
ames_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 80
##
## Operations:
##
## Log transformation on all_outcomes()
Data quality is an important issue for any project involving analyzing data. Data quality issues deserve an entire book in their own right, and a good reference is The Quartz guide to bad data. One of the most common data quality concerns you will run into is missing values.
Data can be missing for many different reasons; however, these reasons are usually lumped into two categories: informative missingness (Kuhn and Johnson 2013) and missingness at random (Little and Rubin 2014). Informative missingness implies a structural cause for the missing value that can provide insight in its own right; whether this be deficiencies in how the data was collected or abnormalities in the observational environment. Missingness at random implies that missing values occur independent of the data collection process2.
The category that drives missing values will determine how you handle them. For example, we may give values that are driven by informative missingness their own category (e.g., "None"
) as their unique value may affect predictive performance. Whereas values that are missing at random may deserve deletion3 or imputation.
Furthermore, different machine learning models handle missingness differently. Most algorithms cannot handle missingness (e.g., generalized linear models and their cousins, neural networks, and support vector machines) and, therefore, require them to be dealt with beforehand. A few models (mainly tree-based), have built-in procedures to deal with missing values. However, since the modeling process involves comparing and contrasting multiple models to identify the optimal one, you will want to handle missing values prior to applying any models so that your algorithms are based on the same data quality assumptions.
It is important to understand the distribution of missing values (i.e., NA
) in any data set. So far, we have been using a pre-processed version of the Ames housing data set. However, if we use the raw Ames housing data, there are actually 13,997 missing values—there is at least one missing values in each row of the original data! Good visualizations can help us understand patterns in the missing data.
ames_raw = pd.read_csv("data/ames_raw.csv")
# count missing values
ames_raw.isnull().sum()
## Order 0
## PID 0
## MS SubClass 0
## MS Zoning 0
## Lot Frontage 490
## ...
## Mo Sold 0
## Yr Sold 0
## Sale Type 0
## Sale Condition 0
## SalePrice 0
## Length: 82, dtype: int64
Check out the missingno package to see all the great ways to to visualize missing data!
# can you identify patterns of missing data
# missingness is represented with white
msno.matrix(ames_raw, labels=True, filter="bottom", sort="ascending", n=50)
# which features have most missing?
# this chart shows the number of observations so small bars (i.e. Pool QC)
# represent very few observed values (lots of missingness)
msno.bar(ames_raw, labels=True, filter="bottom", sort="ascending", n=50)
ames_raw <- read_csv("data/ames_raw.csv")
# total missing values
sum(is.na(ames_raw))
## [1] 14010
Check out the naniar package to see all the great ways to to visualize missing data!
# can you identify patterns of missing data?
# missingness is represented with black
vis_miss(ames_raw)
# which features have most missing?
gg_miss_var(ames_raw)
Imputation is the process of replacing a missing value with a substituted, “best guess” value. Imputation should be one of the first feature engineering steps you take as it will affect any downstream pre-processing4.
An elementary approach to imputing missing values for a feature is to compute descriptive statistics such as the mean, median, or mode (for categorical) and use that value to replace NA
s. Although computationally efficient, this approach does not consider any other attributes for a given observation when imputing (e.g., a female patient that is 63 inches tall may have her weight imputed as 175 lbs since that is the average weight across all observations which contains 65% males that have an average a height of 70 inches).
An alternative is to use grouped statistics to capture expected values for observations that fall into similar groups. However, this becomes infeasible for larger data sets. Modeling imputation can automate this process for you and the two most common methods include K-nearest neighbor and tree-based imputation, which are discussed next.
However, it is important to remember that imputation should be performed within the resampling process and as your data set gets larger, repeated model-based imputation can compound the computational demands. Thus, you must weigh the pros and cons of the two approaches.
The following code snippet shows how to impute different features with the median value of a given feature.
These imputers do not yet perform imputation. At the end of this module we will show you how all these pieces of the feature engineering come together in our ML modeling pipeline.
In Python, the SimpleImputer
can be used to apply an imputation to all features. However, if you only want to apply imputation to a subset of features then you would use ColumnTransformer
in addition to the SimpleImputer
strategy.
# median imputation to all features
a = SimpleImputer(strategy='median')
# median imputation to just numeric predictors
b = ColumnTransformer([("num_imp", a, selector(dtype_include="number"))])
# median imputation to 1 or more features
c = ColumnTransformer([("num_imp", a, selector("Gr_Liv_Area"))])
See more imputation options here.
# median imputation to all features
a <- ames_recipe %>%
step_medianimpute(all_predictors())
# median imputation to just numeric predictors
b <- ames_recipe %>%
step_medianimpute(all_predictors(), all_numeric())
# median imputation to 1 or more features
c <- ames_recipe %>%
step_medianimpute(Gr_Liv_Area)
K-nearest neighbor (KNN) imputes values by identifying observations with missing values, then identifying other observations that are most similar based on the other available features, and using the values from these nearest neighbor observations to impute missing values.
We discuss KNN for predictive modeling in a later module; the imputation application works in a similar manner. In KNN imputation, the missing value for a given observation is treated as the targeted response and is predicted based on the average (for quantitative values) or the mode (for qualitative values) of the k nearest neighbors.
As discussed in the KNN modeling module, if all features are quantitative then standard Euclidean distance is commonly used as the distance metric to identify the k neighbors and when there is a mixture of quantitative and qualitative features then Gower’s distance (Gower 1971) can be used. KNN imputation is best used on small to moderate sized data sets as it becomes computationally burdensome with larger data sets (Kuhn and Johnson 2019).
As we saw in the last module, k is a tunable hyperparameter. Suggested values for imputation are 5–10 (Kuhn and Johnson 2019).
knn_imp = KNNImputer(n_neighbors=6)
knn_imp <- ames_recipe %>%
step_knnimpute(all_predictors(), neighbors = 6)
Several implementations of decision trees and their derivatives can be constructed in the presence of missing values. Thus, they provide a good alternative for imputation. As discussed in later modules, single trees have high variance but aggregating across many trees creates a robust, low variance predictor. Random forest imputation procedures have been studied (Shah et al. 2014; Stekhoven 2015); however, they require significant computational demands in a resampling environment (Kuhn and Johnson 2019). Bagged trees offer a compromise between predictive accuracy and computational burden.
Don’t worry, we’ll discuss decision trees and variants of them indepth in later modules.
Similar to KNN imputation, observations with missing values are identified and the feature containing the missing value is treated as the target and predicted using bagged decision trees.
Unfortunately Python does not provide a tree-based approach for imputation. This would have to be done manually. However, KNN & tree-based imputation tends to perform similarly.
ames_recipe %>%
step_bagimpute(all_predictors())
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 80
##
## Operations:
##
## Log transformation on all_outcomes()
## Bagged tree imputation for all_predictors()
The plot below illustrates the differences between mean, KNN, and tree-based imputation on the raw Ames housing data. It is apparent how descriptive statistic methods (e.g., using the mean and median) are inferior to the KNN and tree-based imputation methods.
When possible use KNN or tree-based imputation. When data sets become extremely large tree-based imputation can be more efficient.
In many data analyses and modeling projects we end up with hundreds or even thousands of collected features. From a practical perspective, a model with more features often becomes harder to interpret and is costly to compute. Some models are more resistant to non-informative predictors (e.g., the Lasso and tree-based methods) than others as illustrated below.5
Although the performance of some of our models are not significantly affected by non-informative predictors, the time to train these models can be negatively impacted as more features are added. The plot below shows the increase in time to perform 10-fold CV on the exemplar data, which consists of 10,000 observations. We see that many algorithms (e.g., elastic nets, random forests, and gradient boosting machines) become extremely time intensive the more predictors we add. Consequently, filtering or reducing features prior to modeling may significantly speed up training time.
Zero and near-zero variance variables are low-hanging fruit to eliminate. Zero variance variables, meaning the feature only contains a single unique value, provides no useful information to a model. Some algorithms are unaffected by zero variance features. However, features that have near-zero variance also offer very little, if any, information to a model. Furthermore, they can cause problems during resampling as there is a high probability that a given sample will only contain a single unique value (the dominant value) for that feature.
A rule of thumb for detecting near-zero variance features are identifying and removing features with \(\leq 5-10\)% variance.
For the Ames data, we do not have any zero variance predictors but there are many features that meet the near-zero threshold. The following shows all features where there is less than 5% variance in distinct values.
## rowname freqRatio percentUnique zeroVar nzv
## 1 Street 226.66667 0.09760859 FALSE TRUE
## 2 Alley 24.25316 0.14641288 FALSE TRUE
## 3 Utilities 1023.00000 0.14641288 FALSE TRUE
## 4 Land_Slope 22.15909 0.14641288 FALSE TRUE
## 5 Land_Contour 19.50000 0.19521718 FALSE TRUE
## 6 Kitchen_AbvGr 21.23913 0.19521718 FALSE TRUE
## 7 Pool_QC 509.75000 0.24402147 FALSE TRUE
## 8 Misc_Feature 34.18966 0.24402147 FALSE TRUE
## 9 Bsmt_Cond 20.24444 0.29282577 FALSE TRUE
## 10 Heating 106.00000 0.29282577 FALSE TRUE
## 11 Condition_2 202.60000 0.34163006 FALSE TRUE
## 12 BsmtFin_Type_2 25.85294 0.34163006 FALSE TRUE
## 13 Roof_Matl 144.35714 0.39043436 FALSE TRUE
## 14 Functional 38.89796 0.39043436 FALSE TRUE
## 15 Pool_Area 2039.00000 0.53684724 FALSE TRUE
## 16 Three_season_porch 673.66667 1.12249878 FALSE TRUE
## 17 Low_Qual_Fin_SF 1010.50000 1.31771596 FALSE TRUE
## 18 Misc_Val 180.54545 1.56173743 FALSE TRUE
## 19 Screen_Porch 169.90909 4.63640800 FALSE TRUE
Other feature filtering methods exist; see Saeys, Inza, and Larrañaga (2007) for a thorough review. Furthermore, several wrapper methods exist that evaluate multiple models using procedures that add or remove predictors to find the optimal combination of features that maximizes model performance (see, for example, Kursa, Rudnicki, and others (2010), Granitto et al. (2006), Maldonado and Weber (2009)). However, this topic is beyond the scope of this course.
See other feature filtering functions here.
nzv = VarianceThreshold(threshold=0.1)
See other feature filtering functions here.
nzv <- ames_recipe %>%
step_nzv(all_predictors(), unique_cut = 10) # threshold as % (10%)
Numeric features can create a host of problems for certain models when their distributions are skewed, contain outliers, or have a wide range in magnitudes. Tree-based models are quite immune to these types of problems in the feature space, but many other models (e.g., GLMs, regularized regression, KNN, support vector machines, neural networks) can be greatly hampered by these issues. Normalizing and standardizing heavily skewed features can help minimize these concerns.
Similar to the process discussed to normalize target variables, parametric models that have distributional assumptions (e.g., GLMs, and regularized models) can benefit from minimizing the skewness of numeric features. When normalizing many variables, it’s best to use the Box-Cox (when feature values are strictly positive) or Yeo-Johnson (when feature values are not strictly positive) procedures as these methods will identify if a transformation is required and what the optimal transformation will be.
Non-parametric models are rarely affected by skewed features; however, normalizing features will not have a negative effect on these models’ performance. For example, normalizing features will only shift the optimal split points in tree-based algorithms. Consequently, when in doubt, normalize.
# Normalizing approach
yj = PowerTransformer(method="yeo-johnson")
# Normalize all numeric features
X_norm = ColumnTransformer([("norm", yj, selector(dtype_include="number"))])
# Normalize all numeric features
X_norm <- ames_recipe %>%
step_YeoJohnson(all_predictors(), all_numeric())
We must also consider the scale on which the individual features are measured. What are the largest and smallest values across all features and do they span several orders of magnitude? Models that incorporate smooth functions of input features are sensitive to the scale of the inputs. For example, \(5X+2\) is a simple linear function of the input X, and the scale of its output depends directly on the scale of the input. Many algorithms use linear functions within their algorithms, some more obvious (e.g., GLMs and regularized regression) than others (e.g., neural networks, support vector machines, and principal components analysis). Other examples include algorithms that use distance measures such as the Euclidean distance (e.g., k nearest neighbor, k-means clustering, and hierarchical clustering).
For these models and modeling components, it is often a good idea to standardize the features. Standardizing features includes centering and scaling so that numeric variables have zero mean and unit variance, which provides a common comparable unit of measure across all the variables.
# Normalizing approach
scaler = StandardScaler()
# standardize all numeric features
std = ColumnTransformer([("norm", scaler, selector(dtype_include="number"))])
# standardize all numeric features
std <- ames_recipe %>%
step_center(predictors(), all_numeric()) %>%
step_scale(predictors(), all_numeric())
Most models require that the predictors take numeric form. There are exceptions; for example, tree-based models naturally handle numeric or categorical features. However, even tree-based models can benefit from pre-processing categorical features. The following sections will discuss a few of the more common approaches to engineer categorical features.
There are many ways to recode categorical variables as numeric. The most common is referred to as one-hot encoding, where we transpose our categorical variables so that each level of the feature is represented as a boolean value. For example, one-hot encoding the left data frame in the below figure results in X
being converted into three columns, one for each level. This is called less than full rank encoding . However, this creates perfect collinearity which causes problems with some predictive modeling algorithms (e.g., ordinary linear regression and neural networks). Alternatively, we can create a full-rank encoding by dropping one of the levels (level c
has been dropped). This is referred to as dummy encoding.
# one-hot encoder
encoder = OneHotEncoder()
# apply to all categorical features
ohe = ColumnTransformer([("one-hot", encoder, selector(dtype_include="object"))])
# dummy encode
encoder = OneHotEncoder(drop='first')
# apply to all categorical features
de = ColumnTransformer([("dummy", encoder, selector(dtype_include="object"))])
# one-hot encode
ohe <- ames_recipe %>%
step_dummy(all_nominal(), one_hot = TRUE)
# dummy encode
de <- ames_recipe %>%
step_dummy(all_nominal(), one_hot = FALSE)
Label encoding is a pure numeric conversion of the levels of a categorical variable. For example, the MS_SubClass
variable has 16 levels, which if recoded with label encoding would convert them to integers based on alphabetical order of the feature values.
## # A tibble: 15 x 2
## original_levels label_encoded_levels
## <chr> <dbl>
## 1 Duplex_All_Styles_and_Ages 1
## 2 One_and_Half_Story_Finished_All_Ages 2
## 3 One_and_Half_Story_Unfinished_All_Ages 3
## 4 One_Story_1945_and_Older 4
## 5 One_Story_1946_and_Newer_All_Styles 5
## 6 One_Story_PUD_1946_and_Newer 6
## 7 One_Story_with_Finished_Attic_All_Ages 7
## 8 PUD_Multilevel_Split_Level_Foyer 8
## 9 Split_Foyer 9
## 10 Split_or_Multilevel 10
## 11 Two_and_Half_Story_All_Ages 11
## 12 Two_Family_conversion_All_Styles_and_Ages 12
## 13 Two_Story_1945_and_Older 13
## 14 Two_Story_1946_and_Newer 14
## 15 Two_Story_PUD_1946_and_Newer 15
# create encoder
encoder = LabelEncoder()
# Label encode a single column
lbl = ColumnTransformer([("label", encoder, "MS_SubClass")])
# Label encode a single column
lbl <- ames_recipe %>%
step_integer(MS_SubClass)
We should be careful with label encoding unordered categorical features because most models will treat them as ordered numeric features. If a categorical feature is naturally ordered then label encoding is a natural choice (most commonly referred to as ordinal encoding). For example, the various quality features in the Ames housing data are ordinal in nature (ranging from Very_Poor
to Very_Excellent
).
## # A tibble: 2,049 x 5
## Overall_Qual Exter_Qual Bsmt_Qual Kitchen_Qual Garage_Qual
## <chr> <chr> <chr> <chr> <chr>
## 1 Average Typical Typical Typical Typical
## 2 Above_Average Typical Typical Typical Typical
## 3 Good Typical Good Typical Typical
## 4 Above_Average Typical Good Typical Typical
## 5 Average Typical Typical Typical Typical
## 6 Above_Average Typical No_Basement Typical Typical
## 7 Average Typical Typical Typical Typical
## 8 Average Typical Typical Typical Typical
## 9 Average Typical Typical Typical No_Garage
## 10 Average Typical Typical Typical Typical
## # … with 2,039 more rows
Ordinal encoding these features provides a natural and intuitive interpretation and can logically be applied to all models.
Scikit-learn has an OrdinalEncoder
; however, it is not very flexible in allowing you to easily specify levels across many features. However, the Category Encoders package provides several scikit-learn style transformers including an OrdinalEncoder
that is more flexible.
# ID all quality features to ordinal encode
cols = list(X_train.filter(regex=("Qual$|QC$|Cond$")).columns)
# specify levels in order
lvs = ["Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical",
"Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"]
val = range(0, len(lvs))
# create a level to integer mapping
lvl_map = dict(zip(lvs, val))
category_mapping = [{'col': col, 'mapping': lvl_map} for col in cols]
# example of first two mappings
category_mapping[0:2]
## [{'col': 'Overall_Qual', 'mapping': {'Very_Poor': 0, 'Poor': 1, 'Fair': 2, 'Below_Average': 3, 'Average': 4, 'Typical': 5, 'Above_Average': 6, 'Good': 7, 'Very_Good': 8, 'Excellent': 9, 'Very_Excellent': 10}}, {'col': 'Overall_Cond', 'mapping': {'Very_Poor': 0, 'Poor': 1, 'Fair': 2, 'Below_Average': 3, 'Average': 4, 'Typical': 5, 'Above_Average': 6, 'Good': 7, 'Very_Good': 8, 'Excellent': 9, 'Very_Excellent': 10}}]
# Apply ordinal encoder
cat_encoder = OrdinalEncoder(cols=cols, mapping=category_mapping)
OrdinalEncoder
will automatically convert any levels not in the specified levels to a -1.
# specify levels in order
lvls <- c("Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical",
"Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent")
# apply ordinal encoding to quality features
ord_lbl <- ames_recipe %>%
# 1. convert quality features to factors
step_string2factor(
ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
levels = lvls
) %>%
# 2. convert any missed levels (i.e. no_pool, no_garage,) to "None"
step_unknown(
ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
new_level = "None"
) %>%
# 3. move "None" level to front ('None', 'Very_Poor', ..., 'Very_Excellent')
step_relevel(
ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
ref_level = "None"
) %>%
step_integer(ends_with("Qual"), ends_with("QC"), ends_with("_Cond"))
Sometimes features will contain levels that have very few observations. For example, there are 28 unique neighborhoods represented in the Ames housing data but several of them only have a few observations.
## # A tibble: 28 x 2
## Neighborhood n
## <chr> <int>
## 1 Landmark 1
## 2 Green_Hills 2
## 3 Greens 3
## 4 Blueste 8
## 5 Veenker 15
## 6 Northpark_Villa 17
## 7 Bloomington_Heights 18
## 8 Meadow_Village 22
## 9 Briardale 23
## 10 Clear_Creek 26
## # … with 18 more rows
Sometimes we can benefit from collapsing, or “lumping” these into a lesser number of categories. In the above examples, we may want to collapse all levels that are observed in less than 10% of the training sample into an “other” category. We can use step_other()
to do so. However, lumping should be used sparingly as there is often a loss in model performance (Kuhn and Johnson 2013).
Tree-based models often perform exceptionally well with high cardinality features and are not as impacted by levels with small representation.
The following code snippets show how to lump all neighborhoods that represent less than 1% of observations into an “other” category.
# create rare label encoder
rare_encoder = RareLabelEncoder(tol=0.01, replace_with="other")
# demonstrate how some neighborhoods are now represented by "other"
rare_encoder.fit_transform(X_train)["Neighborhood"].unique()
## array(['Mitchell', 'Sawyer', 'Edwards', 'College_Creek', 'Meadow_Village',
## 'Northridge_Heights', 'Northwest_Ames', 'Northridge', 'Brookside',
## 'Old_Town', 'Sawyer_West', 'Gilbert',
## 'South_and_West_of_Iowa_State_University', 'Somerset', 'Crawford',
## 'North_Ames', 'Clear_Creek', 'Iowa_DOT_and_Rail_Road',
## 'Timberland', 'other', 'Stone_Brook', 'Bloomington_Heights'],
## dtype=object)
# Lump levels for two features
rare_encoder <- ames_recipe %>%
step_other(Neighborhood, threshold = 0.01, other = "other")
There are several alternative categorical encodings that are implemented in both R and Python that are worth exploring. For example, target encoding is the process of replacing a categorical value with the mean (regression) or proportion (classification) of the target variable. Count/frequency encoding replaces categories with the number of observations or percentage of observations per category. Hash encoding converts strings to arbitrary digits (similar to label encoding but more robust).
Dimension reduction is an alternative approach to filter out non-informative features without manually removing them. This course does not focus indepth on dimension reduction; however, we wanted to highlight that it is very common to include these types of dimension reduction approaches during the feature engineering process. For example, we may wish to reduce the dimension of our features with principal components analysis and retain the number of components required to explain, say, 95% of the variance and use these components as features in downstream modeling.
# PCA object - keep 25 components
pca = PCA(n_components=25)
# apply PCA to all numeric features
pca_encoder = ColumnTransformer([("pca", pca, selector(dtype_include="number"))])
pca_encoder <- ames_recipe %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_pca(all_numeric(), threshold = .95)
We stated at the beginning of this module that we should think of feature engineering as creating a blueprint rather than manually performing each task individually. This helps us in two ways: (1) thinking sequentially and (2) to apply appropriately within the resampling process.
Thinking of feature engineering as a blueprint forces us to think of the ordering of our pre-processing steps. Although each particular problem requires you to think of the effects of sequential pre-processing, there are some general suggestions that you should consider:
While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems:
Data leakage is when information from outside the training data set is used to create the model. Data leakage often occurs during the data pre-processing period. To minimize this, feature engineering should be done in isolation of each resampling iteration. Recall that resampling allows us to estimate the generalizable prediction error. Therefore, we should apply our feature engineering blueprint to each resample independently as illustrated below. That way we are not leaking information from one data set to another (each resample is designed to act as isolated training and test data).
For example, when standardizing numeric features, each resampled training data should use its own mean and variance estimates and these specific values should be applied to the same resampled test set. This imitates how real-life prediction occurs where we only know our current data’s mean and variance estimates; therefore, on new data that comes in where we need to predict we assume the feature values follow the same distribution of what we’ve seen in the past.
To illustrate how this process works together via R code, let’s do a simple re-assessment on the ames
data set that we did at the end of the last module. However, this time we will start with all features and apply a sequence of simple feature engineering steps. Specifically, our feature engineering steps will include:
First, let’s create our
These items are nothing new and similar to our approach in the last module.
# create train/test split
train, test = train_test_split(ames, train_size=0.7, random_state=123)
# separate features from labels and only use numeric features
X_train = train.drop("Sale_Price", axis=1)
y_train = train["Sale_Price"]
# create KNN model object
knn = KNeighborsRegressor()
# define loss function
loss = 'neg_root_mean_squared_error'
# create 10 fold CV object
kfold = KFold(n_splits=10, random_state=123, shuffle=True)
# Create grid of hyperparameter values
hyper_grid = {'knn__n_neighbors': range(2, 26)}
Next, we create our feature engineering recipe with the 6 steps mentioned above. In Python we can use the ColumnTransformer
to specify the steps in order. ColumnTransformer
takes a list of tuples (<step name>, <encoder>, <columns to apply encoder to>)
in the order that you want to apply them. The remainder
parameter allows you to keep any columns not mentioned in the encoding steps otherwise by default they will be dropped.
# 1. Remove near-zero variance features that are categorical
nzv_encoder = VarianceThreshold(threshold=0.1)
# 2. Ordinal encode our quality-based features
ord_cols = list(X_train.filter(regex=("Qual$|QC$|Cond$")).columns)
lvs = ["Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical",
"Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"]
val = range(0, len(lvs))
lvl_map = dict(zip(lvs, val))
category_mapping = [{'col': col, 'mapping': lvl_map} for col in ord_cols]
ord_encoder = OrdinalEncoder(cols=ord_cols, mapping=category_mapping)
# 3. Center and scale (i.e., standardize) all numeric features
scaler = StandardScaler()
# 4. Perform dimension reduction by applying PCA to all numeric features
pca = PCA(n_components=30)
# 5. One-hot encode remaining categorical features.
encoder = OneHotEncoder(handle_unknown="ignore")
# combine all steps into a pre-processing pipeline
preprocessor = ColumnTransformer(
remainder="passthrough",
transformers=[
("nzv_encode", nzv_encoder, selector(dtype_include="number")),
("ord_encode", ord_encoder, ord_cols),
("std_encode", scaler, selector(dtype_include="number")),
("pca_encode", pca, selector(dtype_include="number")),
("one-hot", encoder, selector(dtype_include="object")),
])
We can now create a modeling pipeline that includes the pre-processing step and then the modeling step. Similar to ColumnTransformer
, Pipeline
takes tuples with the name and object for each step in the order you want them applied.
model_pipeline = Pipeline(steps=[
("preprocessor", preprocessor),
("knn", knn),
])
Now we can build our GridSearchCV
object with our model process objects. This is where the feature engineering actually happens as grid_search.fit
will apply the feature engineering steps individually within each of the kfold resamples so there is no data leakage.
# Tune a knn model using grid search
grid_search = GridSearchCV(model_pipeline, hyper_grid, cv=kfold, scoring=loss)
results = grid_search.fit(X_train, y_train)
# Best model's cross validated RMSE
abs(results.best_score_)
## 41915.58187389292
# Best model's k value
results.best_estimator_.get_params().get('knn__n_neighbors')
## 5
# Plot all RMSE results
all_rmse = pd.DataFrame({'k': range(2, 26),
'RMSE': np.abs(results.cv_results_['mean_test_score'])})
(ggplot(all_rmse, aes(x='k', y='RMSE'))
+ geom_line()
+ geom_point()
+ ggtitle("Cross validated grid search results"))
## <ggplot: (320319648)>
First, let’s create our
These items are nothing new and similar to our approach in the last module.
# create train/test split
set.seed(123) # for reproducibility
split <- initial_split(ames, prop = 0.7)
train <- training(split)
test <- testing(split)
# create resampling procedure
kfold <- vfold_cv(train, v = 10)
# model object
knn <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
# Create grid of hyperparameter values
hyper_grid <- expand.grid(neighbors = seq(2, 25, by = 1))
Next, we create our model and feature engineering recipe with the 6 steps mentioned above.
# model recipe with feature engineering steps
model_form <- recipe(Sale_Price ~ ., data = train) %>%
# 1. Remove near-zero variance features that are categorical
step_nzv(all_nominal()) %>%
# 2. Ordinal encode our quality-based features
step_string2factor(
ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
levels = lvls
) %>%
step_unknown(
ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
new_level = "None"
) %>%
step_relevel(
ends_with("Qual"), ends_with("QC"), ends_with("_Cond"),
ref_level = "None"
) %>%
step_integer(ends_with("Qual"), ends_with("QC"), ends_with("_Cond")) %>%
# 3. Center and scale (i.e., standardize) all numeric features
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
# 4. Perform dimension reduction by applying PCA to all numeric features.
step_pca(all_numeric(), -all_outcomes(), num_comp = 30) %>%
# 5. Collapse categorical levels with small representation
step_other(all_nominal(), threshold = 0.01, other = "other") %>%
# 6. One-hot encode the remaining categorical features.
step_dummy(all_nominal(), one_hot = TRUE)
Now we can run tune_grid
and supply it all our objects. This is where the feature engineering actually happens as tune_grid
will apply the feature engineering steps individually within each of the kfold resamples so there is no data leakage.
# Tune a knn model using grid search
results <- tune_grid(knn, model_form, resamples = kfold, grid = hyper_grid)
# best model
show_best(results, metric = "rmse")
## # A tibble: 5 x 7
## neighbors .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 14 rmse standard 39234. 9 1977. Preprocessor1_Model13
## 2 15 rmse standard 39235. 9 1981. Preprocessor1_Model14
## 3 16 rmse standard 39246. 9 1985. Preprocessor1_Model15
## 4 13 rmse standard 39252. 9 1972. Preprocessor1_Model12
## 5 17 rmse standard 39264. 9 1990. Preprocessor1_Model16
# plot results
results %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
ggplot(aes(neighbors, mean)) +
geom_line() +
geom_point() +
labs(x = "k", y = "RMSE", title = "Cross validated grid search results")
Using the Ames dataset and the same approach shown in the last section:
Rather than use a 70% stratified training split, try an 80% unstratified training split. How does your cross-validated results compare?
Rather than numerically encode the quality and condition features (i.e. step_integer(matches("Qual|Cond|QC"))
), one-hot encode these features. What is the difference in the number of features in your training set? Apply the same cross-validated KNN model to this new feature set. How does the performance change? How does the training time change?
Identify three new feature engineering steps that are provided by recipes, scikit-learn or some other open source R/Python package:
Compare the Python and R pre-processing steps in the last section. Can you spot any discrepencies? If so, resolve them so the pipelines are performing the same steps.
Using the Attrition data set, assess the characteristics of the target and features.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques↩︎
Little and Rubin (2014) discuss two different kinds of missingness at random; however, we combine them for simplicity as their nuanced differences are distinguished between the two in practice.↩︎
If your data set is large, deleting missing observations that have missing values at random rarely impacts predictive performance. However, as your data sets get smaller, preserving observations is critical and alternative solutions should be explored.↩︎
For example, standardizing numeric features will include the imputed numeric values in the calculation and one-hot encoding will include the imputed categorical value.↩︎
See Kuhn and Johnson (2013) Section 19.1 for data set generation.↩︎