Machine learning (ML) continues to grow in importance for many organizations across nearly all domains. Some example applications of machine learning in practice include:

  • Predicting the likelihood of a patient returning to the hospital (readmission) within 30 days of discharge.
  • Segmenting customers based on common attributes or purchasing behavior for targeted marketing.
  • Predicting coupon redemption rates for a given marketing campaign.
  • Predicting customer churn so an organization can perform preventative intervention.
  • And many more!

In essence, these tasks all seek to learn from data. To address each scenario, we can use a given set of features to train an algorithm and extract insights. These algorithms, or learners, can be classified according to the amount and type of supervision needed during training.

Learning objectives

This module will introduce you to some fundamental concepts around ML and this class. By the end of this module you will:

  1. Be able to explain the difference between supervised and unsupervised learning.
  2. Know when a problem is considered a regression or classification problem.
  3. Understand the objective and structure of this course and the type of exercises involved.
  4. Be able to import and explore the data sets we’ll use through various examples.

Supervised learning

A predictive model is used for tasks that involve the prediction of a given output (or target) using other variables (or features) in the data set. The learning algorithm in a predictive model attempts to discover and model the relationships among the target variable (the variable being predicted) and the other features (aka predictor variables). Examples of predictive modeling include:

  • using customer attributes to predict the probability of the customer churning in the next 6 weeks;
  • using home attributes to predict the sales price;
  • using employee attributes to predict the likelihood of attrition;
  • using patient attributes and symptoms to predict the risk of readmission;
  • using production attributes to predict time to market.

Each of these examples has a defined learning task; they each intend to use attributes (\(X\)) to predict an outcome measurement (\(Y\)).

Throughout this course we’ll use various terms interchangeably for

  • \(X\): “predictor variable,” “independent variable,” “attribute,” “feature,” “predictor”
  • \(Y\): “target variable,” “dependent variable,” “response,” “outcome measurement”

The predictive modeling examples above describe what is known as supervised learning. The supervision refers to the fact that the target values provide a supervisory role, which indicates to the learner the task it needs to learn. Specifically, given a set of data, the learning algorithm attempts to optimize a function (the algorithmic steps) to find the combination of feature values that results in a predicted value that is as close to the actual target output as possible.

In supervised learning, the training data you feed the algorithm includes the target values. Consequently, the solutions can be used to help supervise the training process to find the optimal algorithm parameters.

Most supervised learning problems can be bucketed into one of two categories, regression or classification, which we discuss next.

Regression problems

When the objective of our supervised learning is to predict a numeric outcome, we refer to this as a regression problem (not to be confused with linear regression modeling). Regression problems revolve around predicting output that falls on a continuum. In the examples above, predicting home sales prices and time to market reflect a regression problem because the output is numeric and continuous. This means, given the combination of predictor values, the response value could fall anywhere along some continuous spectrum (e.g., the predicted sales price of a particular home could be between $80,000 and $755,000). The figure below illustrates average home sales prices as a function of two home features: year built and total square footage. Depending on the combination of these two features, the expected home sales price could fall anywhere along a plane.

Fig 1: Average home sales price as a function of year built and total square footage.

Classification problems

When the objective of our supervised learning is to predict a categorical outcome, we refer to this as a classification problem. Classification problems most commonly revolve around predicting a binary or multinomial response measure such as:

  • Did a customer redeem a coupon (coded as yes/no or 1/0)?
  • Did a customer churn (coded as yes/no or 1/0)?
  • Did a customer click on our online ad (coded as yes/no or 1/0)?
  • Classifying customer reviews:
    • Binary: positive vs. negative.
    • Multinomial: extremely negative to extremely positive on a 0–5 Likert scale.

Fig 2: Classification problem modeling ‘Yes’/‘No’ response based on three features.

However, when we apply machine learning models for classification problems, rather than predict a particular class (i.e., “yes” or “no”), we often want to predict the probability of a particular class (i.e., yes: 0.65, no: 0.35). By default, the class with the highest predicted probability becomes the predicted class. Consequently, even though we are performing a classification problem, we are still predicting a numeric output (probability). However, the essence of the problem still makes it a classification problem.

Although there are machine learning algorithms that can be applied to regression problems but not classification and vice versa, most of the supervised learning algorithms we cover in this module can be applied to both. These algorithms have become the most popular machine learning applications in recent years.

Unsupervised learning

Unsupervised learning, in contrast to supervised learning, includes a set of statistical tools to better understand and describe your data, but performs the analysis without a target variable. In essence, unsupervised learning is concerned with identifying groups in a data set. The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction); however, the motive in each case is quite different.

The goal of clustering is to segment observations into similar groups based on the observed variables; for example, to divide consumers into different homogeneous groups, a process known as market segmentation. In dimension reduction, we are often concerned with reducing the number of variables in a data set. For example, classical linear regression models break down in the presence of highly correlated features. Some dimension reduction techniques can be used to reduce the feature set to a potentially smaller set of uncorrelated variables. Such a reduced feature set is often used as input to downstream supervised learning models (e.g., principal component regression).

Unsupervised learning is often performed as part of an exploratory data analysis (EDA). However, the exercise tends to be more subjective, and there is no simple goal for the analysis, such as prediction of a response. Furthermore, it can be hard to assess the quality of results obtained from unsupervised learning methods. The reason for this is simple. If we fit a predictive model using a supervised learning technique (i.e., linear regression), then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised!

Despite its subjectivity, the importance of unsupervised learning should not be overlooked and such techniques are often used in organizations to:

  • Divide consumers into different homogeneous groups so that tailored marketing strategies can be developed and deployed for each segment.
  • Identify groups of online shoppers with similar browsing and purchase histories, as well as items that are of particular interest to the shoppers within each group. Then an individual shopper can be preferentially shown the items in which he or she is particularly likely to be interested, based on the purchase histories of similar shoppers.
  • Identify products that have similar purchasing behavior so that managers can manage them as product groups.

These questions, and many more, can be addressed with unsupervised learning. Moreover, the outputs of unsupervised learning models can be used as inputs to downstream supervised learning models.

Objective

The goal of this course is to provide effective tools for uncovering relevant and useful patterns in your data by using the R and Python ML ecosystems with a focus on supervised learning. The progression of this course is designed around 4 themes:

  • Lessons to teach you how the sub-tasks of ML fit together.
  • Code recipes to illustrate how to apply ML tasks and workflows with R & Python.
  • Exercises to get your applying what you learned and deeping your understanding.
  • Portfolio builders to force you to bring together your knowledge and create end-to-end solutions.

Lessons

The lessons are designed to help you understand the individual sub-tasks of an ML project. The focus is to have an intuitive understanding of each discrete sub-task. Once you understand when, where, and why these sub-tasks are performed you will be able to transfer this knowledge to other projects. The concepts you will learn include:

  1. Provide an overview of the ML modeling process:
    • feature engineering
    • data splitting
    • model fitting
    • model validation and tuning
    • performance measurement
  2. Cover common supervised learners:
    • linear regression
    • regularized regression
    • K-nearest neighbors
    • decision trees
    • bagging & random forests
    • gradient boosting
  3. Illustrate how to maximize predictive performance with:
    • hyperparameter
    • stacking models

Code Recipes

To help your understanding we provide code recipes in both R and Python so that you can start implementing these ML sub-tasks in both languages. These recipes will include

  • Small recipes that illustrate discrete tasks such as normalizing features or tuning a random forest model.
  • Large recipes that demonstrate how to put several sub-tasks together for a larger ML workflow. This may include (a) creating a training sample, (b) applying feature engineering sub-tasks, (c) performing a grid search for a k-nearest neighbor model, (d) assessing model performance.

Exercises

At the end of each module we provide additional exercises for you to perform. These exercises force you to apply your knew knowledge on different data sets. For the exercises we use small well-understood data sets because:

  • They are small, which means they can easily be ran on your local machine in a reasonable time.
  • They are well behaved, meaning you often don’t need to do a lot of feature engineering to get a good result and there is often a small range of options for good results.
  • They are benchmarks meaning that many people have used them before and you can get ideas of good approaches to maximize performance.

Portfolio builders

Newcomers to the world of ML can have a difficult time extrapolating what they have learned and applying the ML process to new, more complex data sets that don’t have benchmark examples. The portfolio builder exercises are designed to get you working through this challenge by identifying new data sets and applying the ML process to uncover patterns in the data. Whether patterns, strong or weak, exist is not the most important goal; rather, getting comfortable working through the ML process on new data sets where examples are not prevelant is!

Python & R

You will always find a debate between which language is “best” for machine learning – Python or R. Unfortunately this is a poor way to think about ML and these two languages.

First, your objective should be to understand the fundamental machine learning concepts. Second, you should have a solid understanding of how to apply these concepts in either language because which one you use can largely be driven by the culture of the organization you work for.

Consequently, this course will illustrate how to apply machine learning in both languages. Code recipes will be supplied in Python () and R () tabs as illustrated here:

Scikit-learn is the predominate Python package for machine learning. Unlike R, scikit-learn provides nearly all components required for the modeling process (i.e. sampling, feature engineering, modeling, evaluation). Scikit-learn is part of the SciPy ecosystem, which is a group of Python libraries for mathematics, science and engineering. Other packages that you will commonly use for ML in Python include:

  • Numpy: Provides foundational data structures and computations to efficiently work with data arrays.
  • Matplotlib: Provides data visualization capabilities.
  • Pandas: Tools and data structures to organize and analyze your data.

The following Python packages are used throughout this module. You may want to use a virtual environment; however, most code recipes should run regardless of small deviations in package versions.

# data management
pip install -U pandas
pip install -U numpy

# data visualization
pip install -U matplotlib
pip install -U plotnine

# modeling
pip install -U scikit-learn
import sklearn

sklearn.__version__
## '0.24.2'

Historically, the R ecosystem provides a wide variety of ML algorithm implementations. This has its benefits; however, this also drawbacks as it requires the users to learn many different formula interfaces and syntax nuances. More recently, development on a group of packages called Tidymodels has helped to make implementation easier.

Whereas in Python you can perform most, if not all, of the ML sub-tasks with scikit-learn, the tidymodels collection allows you to perform discrete parts of the ML workflow with discrete packages:

  • rsample for data splitting and resampling
  • recipes for data pre-processing and feature engineering
  • parsnip for applying algorithms
  • tune for hyperparameter tuning
  • yardstick for measuring model performance

The following R packages are used throughout this module. You may want to use a virtual environment; however, most code recipes should run regardless of small deviations in package versions. Note that when you install tidymodels you are actually installing several packages that exist in the tidymodels framework as discussed above.

# common data wrangling and visualization
install.packages("tidyverse")
install.packages("vip")

# modeling
install.packages("tidymodels")
packageVersion("tidymodels")
## [1] '0.1.3'

library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.3 ──
## ✓ broom        0.7.7      ✓ rsample      0.1.0 
## ✓ dials        0.0.9      ✓ tibble       3.1.2 
## ✓ dplyr        1.0.7      ✓ tidyr        1.1.3 
## ✓ infer        0.5.4      ✓ tune         0.1.5 
## ✓ modeldata    0.1.0      ✓ workflows    0.2.2 
## ✓ parsnip      0.1.6      ✓ workflowsets 0.0.2 
## ✓ purrr        0.3.4      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.16
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks plotly::filter(), stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

The data sets

The data sets chosen for this course allow us to illustrate the different features of the presented machine learning algorithms. Since the goal of this course is to demonstrate how to implement ML workflows, we make the assumption that you have already spent significant time cleaning and getting to know your data via EDA. This would allow you to perform many necessary tasks prior to the ML tasks outlined in this course such as:

  • Feature selection (i.e., removing unnecessary variables and retaining only those variables you wish to include in your modeling process).
  • Recoding variable names and values so that they are meaningful and more interpretable.
  • Recoding, removing, or some other approach to handling missing values.

Consequently, the exemplar data sets we use throughout this book have, for the most part, gone through the necessary cleaning processes. As mentioned above, these data sets are fairly common data sets that provide good benchmarks to compare and illustrate ML workflows. Although some of these data sets are available in R and/or Python, we will import these data sets from a .csv file to ensure commonality regardless of language.

Boston housing

The Boston Housing data set is derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. Originally published in Harrison Jr and Rubinfeld (1978) , it contains 13 attributes to predict the median property value.

  • problem type: supervised regression
  • response variable: medv median value of owner-occupied homes in USD 1000’s (i.e. 21.8, 24.5)
  • features: 13
  • observations: 506
  • objective: use property attributes to predict the median value of owner-occupied homes

# Pandas has already been imported
import pandas as pd

# access data
boston = pd.read_csv("data/boston.csv")

# initial dimensions
boston.shape
## (506, 16)
# features
boston.drop("cmedv", axis=1).head()
##       lon      lat     crim    zn  indus  ...  rad  tax  ptratio       b  lstat
## 0 -70.955  42.2550  0.00632  18.0   2.31  ...    1  296     15.3  396.90   4.98
## 1 -70.950  42.2875  0.02731   0.0   7.07  ...    2  242     17.8  396.90   9.14
## 2 -70.936  42.2830  0.02729   0.0   7.07  ...    2  242     17.8  392.83   4.03
## 3 -70.928  42.2930  0.03237   0.0   2.18  ...    3  222     18.7  394.63   2.94
## 4 -70.922  42.2980  0.06905   0.0   2.18  ...    3  222     18.7  396.90   5.33
## 
## [5 rows x 15 columns]
# response variable
boston["cmedv"].head()
## 0    24.0
## 1    21.6
## 2    34.7
## 3    33.4
## 4    36.2
## Name: cmedv, dtype: float64

# access data
boston <- readr::read_csv("data/boston.csv") 

# initial dimension
dim(boston)
## [1] 506  16

# features
dplyr::select(boston, -cmedv)
## # A tibble: 506 x 15
##      lon   lat    crim    zn indus  chas   nox    rm   age   dis   rad   tax
##    <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 -71.0  42.3 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1   296
##  2 -71.0  42.3 0.0273    0    7.07     0 0.469  6.42  78.9  4.97     2   242
##  3 -70.9  42.3 0.0273    0    7.07     0 0.469  7.18  61.1  4.97     2   242
##  4 -70.9  42.3 0.0324    0    2.18     0 0.458  7.00  45.8  6.06     3   222
##  5 -70.9  42.3 0.0690    0    2.18     0 0.458  7.15  54.2  6.06     3   222
##  6 -70.9  42.3 0.0298    0    2.18     0 0.458  6.43  58.7  6.06     3   222
##  7 -70.9  42.3 0.0883   12.5  7.87     0 0.524  6.01  66.6  5.56     5   311
##  8 -70.9  42.3 0.145    12.5  7.87     0 0.524  6.17  96.1  5.95     5   311
##  9 -70.9  42.3 0.211    12.5  7.87     0 0.524  5.63 100    6.08     5   311
## 10 -70.9  42.3 0.170    12.5  7.87     0 0.524  6.00  85.9  6.59     5   311
## # … with 496 more rows, and 3 more variables: ptratio <dbl>, b <dbl>,
## #   lstat <dbl>

# response variable
head(boston$cmedv)
## [1] 24.0 21.6 34.7 33.4 36.2 28.7

Pima Indians Diabetes

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases and published in smith1988using , it contains 8 attributes to predict the presence of diabetes.

  • problem type: supervised binary classification
  • response variable: diabetes positive or negative response (i.e. “pos,” “neg”)
  • features: 8
  • observations: 768
  • objective: use biological attributes to predict the presence of diabetes

# Pandas has already been imported
# import pandas as pd

# access data
pima = pd.read_csv("data/pima.csv")

# initial dimensions
pima.shape
## (768, 9)
# features
pima.drop("diabetes", axis=1).head()
##    pregnant  glucose  pressure  triceps  insulin  mass  pedigree  age
## 0         6      148        72       35        0  33.6     0.627   50
## 1         1       85        66       29        0  26.6     0.351   31
## 2         8      183        64        0        0  23.3     0.672   32
## 3         1       89        66       23       94  28.1     0.167   21
## 4         0      137        40       35      168  43.1     2.288   33
# response variable
pima["diabetes"].head()
## 0    pos
## 1    neg
## 2    pos
## 3    neg
## 4    pos
## Name: diabetes, dtype: object

# access data
pima <- readr::read_csv("data/pima.csv") 

# initial dimension
dim(pima)
## [1] 768   9

# features
dplyr::select(pima, -diabetes)
## # A tibble: 768 x 8
##    pregnant glucose pressure triceps insulin  mass pedigree   age
##       <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl>
##  1        6     148       72      35       0  33.6    0.627    50
##  2        1      85       66      29       0  26.6    0.351    31
##  3        8     183       64       0       0  23.3    0.672    32
##  4        1      89       66      23      94  28.1    0.167    21
##  5        0     137       40      35     168  43.1    2.29     33
##  6        5     116       74       0       0  25.6    0.201    30
##  7        3      78       50      32      88  31      0.248    26
##  8       10     115        0       0       0  35.3    0.134    29
##  9        2     197       70      45     543  30.5    0.158    53
## 10        8     125       96       0       0   0      0.232    54
## # … with 758 more rows

# response variable
head(pima$diabetes)
## [1] "pos" "neg" "pos" "neg" "pos" "neg"

Iris flowers

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper (Fisher 1936) . It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

  • problem type: supervised multinomial classification
  • response variable: species (i.e. “setosa,” “virginica,” “versicolor”)
  • features: 4
  • observations: 150
  • objective: use plant leaf attributes to predict the type of flower

# Pandas has already been imported
# import pandas as pd

# access data
iris = pd.read_csv("data/iris.csv")

# initial dimensions
iris.shape
## (150, 5)
# features
iris.drop("Species", axis=1).head()
##    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
## 0           5.1          3.5           1.4          0.2
## 1           4.9          3.0           1.4          0.2
## 2           4.7          3.2           1.3          0.2
## 3           4.6          3.1           1.5          0.2
## 4           5.0          3.6           1.4          0.2
# response variable
iris["Species"].head()
## 0    setosa
## 1    setosa
## 2    setosa
## 3    setosa
## 4    setosa
## Name: Species, dtype: object

# access data
iris <- readr::read_csv("data/iris.csv") 

# initial dimension
dim(iris)
## [1] 150   5

# features
dplyr::select(iris, -Species)
## # A tibble: 150 x 4
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
##           <dbl>       <dbl>        <dbl>       <dbl>
##  1          5.1         3.5          1.4         0.2
##  2          4.9         3            1.4         0.2
##  3          4.7         3.2          1.3         0.2
##  4          4.6         3.1          1.5         0.2
##  5          5           3.6          1.4         0.2
##  6          5.4         3.9          1.7         0.4
##  7          4.6         3.4          1.4         0.3
##  8          5           3.4          1.5         0.2
##  9          4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
## # … with 140 more rows

# response variable
head(iris$Species)
## [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

Ames housing

The Ames housing data set is an alternative to the Boston housing data set and provides a more comprehensive set of home features to predict sales price. More information can be found in De Cock (2011) .

  • problem type: supervised regression
  • response variable: Sale_Price (i.e., $195,000, $215,000)
  • features: 80
  • observations: 2,930
  • objective: use property attributes to predict the sale price of a home

# Pandas has already been imported
# import pandas as pd

# access data
ames = pd.read_csv("data/ames.csv")

# initial dimensions
ames.shape
## (2930, 81)
# features
ames.drop("Sale_Price", axis=1).head()
##                            MS_SubClass  ...   Latitude
## 0  One_Story_1946_and_Newer_All_Styles  ...  42.054035
## 1  One_Story_1946_and_Newer_All_Styles  ...  42.053014
## 2  One_Story_1946_and_Newer_All_Styles  ...  42.052659
## 3  One_Story_1946_and_Newer_All_Styles  ...  42.051245
## 4             Two_Story_1946_and_Newer  ...  42.060899
## 
## [5 rows x 80 columns]
# response variable
ames["Sale_Price"].head()
## 0    215000
## 1    105000
## 2    172000
## 3    244000
## 4    189900
## Name: Sale_Price, dtype: int64

# access data
ames <- readr::read_csv("data/ames.csv") 

# initial dimension
dim(ames)
## [1] 2930   81

# features
dplyr::select(ames, -Sale_Price)
## # A tibble: 2,930 x 80
##    MS_SubClass      MS_Zoning    Lot_Frontage Lot_Area Street Alley   Lot_Shape 
##    <chr>            <chr>               <dbl>    <dbl> <chr>  <chr>   <chr>     
##  1 One_Story_1946_… Residential…          141    31770 Pave   No_All… Slightly_…
##  2 One_Story_1946_… Residential…           80    11622 Pave   No_All… Regular   
##  3 One_Story_1946_… Residential…           81    14267 Pave   No_All… Slightly_…
##  4 One_Story_1946_… Residential…           93    11160 Pave   No_All… Regular   
##  5 Two_Story_1946_… Residential…           74    13830 Pave   No_All… Slightly_…
##  6 Two_Story_1946_… Residential…           78     9978 Pave   No_All… Slightly_…
##  7 One_Story_PUD_1… Residential…           41     4920 Pave   No_All… Regular   
##  8 One_Story_PUD_1… Residential…           43     5005 Pave   No_All… Slightly_…
##  9 One_Story_PUD_1… Residential…           39     5389 Pave   No_All… Slightly_…
## 10 Two_Story_1946_… Residential…           60     7500 Pave   No_All… Regular   
## # … with 2,920 more rows, and 73 more variables: Land_Contour <chr>,
## #   Utilities <chr>, Lot_Config <chr>, Land_Slope <chr>, Neighborhood <chr>,
## #   Condition_1 <chr>, Condition_2 <chr>, Bldg_Type <chr>, House_Style <chr>,
## #   Overall_Qual <chr>, Overall_Cond <chr>, Year_Built <dbl>,
## #   Year_Remod_Add <dbl>, Roof_Style <chr>, Roof_Matl <chr>,
## #   Exterior_1st <chr>, Exterior_2nd <chr>, Mas_Vnr_Type <chr>,
## #   Mas_Vnr_Area <dbl>, Exter_Qual <chr>, Exter_Cond <chr>, Foundation <chr>,
## #   Bsmt_Qual <chr>, Bsmt_Cond <chr>, Bsmt_Exposure <chr>,
## #   BsmtFin_Type_1 <chr>, BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <chr>,
## #   BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, Heating <chr>,
## #   Heating_QC <chr>, Central_Air <chr>, Electrical <chr>, First_Flr_SF <dbl>,
## #   Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>, Gr_Liv_Area <dbl>,
## #   Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, Full_Bath <dbl>,
## #   Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
## #   Kitchen_Qual <chr>, TotRms_AbvGrd <dbl>, Functional <chr>,
## #   Fireplaces <dbl>, Fireplace_Qu <chr>, Garage_Type <chr>,
## #   Garage_Finish <chr>, Garage_Cars <dbl>, Garage_Area <dbl>,
## #   Garage_Qual <chr>, Garage_Cond <chr>, Paved_Drive <chr>,
## #   Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, Enclosed_Porch <dbl>,
## #   Three_season_porch <dbl>, Screen_Porch <dbl>, Pool_Area <dbl>,
## #   Pool_QC <chr>, Fence <chr>, Misc_Feature <chr>, Misc_Val <dbl>,
## #   Mo_Sold <dbl>, Year_Sold <dbl>, Sale_Type <chr>, Sale_Condition <chr>,
## #   Longitude <dbl>, Latitude <dbl>

# response variable
head(ames$Sale_Price)
## [1] 215000 105000 172000 244000 189900 195500

Attrition

The employee attrition data set was originally provided by IBM Watson Analytics Lab and is a fictional data set created by IBM data scientists to explore what employee attributes influence attrition.

  • problem type: supervised binomial classification
  • response variable: Attrition (i.e., “Yes,” “No”)
  • features: 30
  • observations: 1,470
  • objective: use employee attributes to predict if they will attrit (leave the company)

# Pandas has already been imported
# import pandas as pd

# access data
attrition = pd.read_csv("data/attrition.csv")

# initial dimensions
attrition.shape
## (1470, 31)
# features
attrition.drop("Attrition", axis=1).head()
##    Age     BusinessTravel  ...  YearsSinceLastPromotion YearsWithCurrManager
## 0   41      Travel_Rarely  ...                        0                    5
## 1   49  Travel_Frequently  ...                        1                    7
## 2   37      Travel_Rarely  ...                        0                    0
## 3   33  Travel_Frequently  ...                        3                    0
## 4   27      Travel_Rarely  ...                        2                    2
## 
## [5 rows x 30 columns]
# response variable
attrition["Attrition"].head()
## 0    Yes
## 1     No
## 2    Yes
## 3     No
## 4     No
## Name: Attrition, dtype: object

# access data
attrition <- readr::read_csv("data/attrition.csv") 

# initial dimension
dim(attrition)
## [1] 1470   31

# features
dplyr::select(attrition, -Attrition)
## # A tibble: 1,470 x 30
##      Age BusinessTravel   DailyRate Department      DistanceFromHome Education  
##    <dbl> <chr>                <dbl> <chr>                      <dbl> <chr>      
##  1    41 Travel_Rarely         1102 Sales                          1 College    
##  2    49 Travel_Frequent…       279 Research_Devel…                8 Below_Coll…
##  3    37 Travel_Rarely         1373 Research_Devel…                2 College    
##  4    33 Travel_Frequent…      1392 Research_Devel…                3 Master     
##  5    27 Travel_Rarely          591 Research_Devel…                2 Below_Coll…
##  6    32 Travel_Frequent…      1005 Research_Devel…                2 College    
##  7    59 Travel_Rarely         1324 Research_Devel…                3 Bachelor   
##  8    30 Travel_Rarely         1358 Research_Devel…               24 Below_Coll…
##  9    38 Travel_Frequent…       216 Research_Devel…               23 Bachelor   
## 10    36 Travel_Rarely         1299 Research_Devel…               27 Bachelor   
## # … with 1,460 more rows, and 24 more variables: EducationField <chr>,
## #   EnvironmentSatisfaction <chr>, Gender <chr>, HourlyRate <dbl>,
## #   JobInvolvement <chr>, JobLevel <dbl>, JobRole <chr>, JobSatisfaction <chr>,
## #   MaritalStatus <chr>, MonthlyIncome <dbl>, MonthlyRate <dbl>,
## #   NumCompaniesWorked <dbl>, OverTime <chr>, PercentSalaryHike <dbl>,
## #   PerformanceRating <chr>, RelationshipSatisfaction <chr>,
## #   StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## #   TrainingTimesLastYear <dbl>, WorkLifeBalance <chr>, YearsAtCompany <dbl>,
## #   YearsInCurrentRole <dbl>, YearsSinceLastPromotion <dbl>,
## #   YearsWithCurrManager <dbl>

# response variable
head(attrition$Attrition)
## [1] "Yes" "No"  "Yes" "No"  "No"  "No"

Hitters

This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. The idea was to illustrate if and how major league baseball player’s batting performance could predict their salary. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York. Note that the data does contain the players name but this should be removed during analysis and is not a valid feature.

  • problem type: supervised regression
  • response variable: Salary
  • features: 19
  • observations: 322
  • objective: use baseball player’s batting attributes to predict their salary.

# access data
hitters = pd.read_csv("data/hitters.csv")

# initial dimensions
hitters.shape
## (322, 21)
# features
hitters.drop(["Salary", "Player"], axis=1).head()
##    AtBat  Hits  HmRun  Runs  RBI  ...  Division  PutOuts  Assists  Errors  NewLeague
## 0    293    66      1    30   29  ...         E      446       33      20          A
## 1    315    81      7    24   38  ...         W      632       43      10          N
## 2    479   130     18    66   72  ...         W      880       82      14          A
## 3    496   141     20    65   78  ...         E      200       11       3          N
## 4    321    87     10    39   42  ...         E      805       40       4          N
## 
## [5 rows x 19 columns]
# response variable
hitters["Salary"].head()
## 0      NaN
## 1    475.0
## 2    480.0
## 3    500.0
## 4     91.5
## Name: Salary, dtype: float64

# access data
hitters <- readr::read_csv("data/hitters.csv") 

# initial dimension
dim(hitters)
## [1] 322  21

# features
dplyr::select(hitters, -Salary, -Player)
## # A tibble: 322 x 19
##    AtBat  Hits HmRun  Runs   RBI Walks Years CAtBat CHits CHmRun CRuns  CRBI
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1   293    66     1    30    29    14     1    293    66      1    30    29
##  2   315    81     7    24    38    39    14   3449   835     69   321   414
##  3   479   130    18    66    72    76     3   1624   457     63   224   266
##  4   496   141    20    65    78    37    11   5628  1575    225   828   838
##  5   321    87    10    39    42    30     2    396   101     12    48    46
##  6   594   169     4    74    51    35    11   4408  1133     19   501   336
##  7   185    37     1    23     8    21     2    214    42      1    30     9
##  8   298    73     0    24    24     7     3    509   108      0    41    37
##  9   323    81     6    26    32     8     2    341    86      6    32    34
## 10   401    92    17    49    66    65    13   5206  1332    253   784   890
## # … with 312 more rows, and 7 more variables: CWalks <dbl>, League <chr>,
## #   Division <chr>, PutOuts <dbl>, Assists <dbl>, Errors <dbl>, NewLeague <chr>

# response variable
head(hitters$Salary)
## [1]    NA 475.0 480.0 500.0  91.5 750.0

Exercises

  1. Identify four real-life applications of supervised and unsupervised problems.
    • Explain what makes these problems supervised versus unsupervised.
    • For each problem identify the target variable (if applicable) and potential features.
  2. Identify and contrast a regression problem with a classification problem.
    • What is the target variable in each problem and why would being able to accurately predict this target be beneficial to society?
    • What are potential features and where could you collect this information?
    • What is determining if the problem is a regression or a classification problem?
  3. Identify three open source data sets suitable for machine learning (e.g., https://bit.ly/35wKu5c).
    • Explain the type of machine learning models that could be constructed from the data (e.g., supervised versus unsupervised and regression versus classification).
    • What are the dimensions of the data?
    • Is there a code book that explains who collected the data, why it was originally collected, and what each variable represents?
    • If the data set is suitable for supervised learning, which variable(s) could be considered as a useful target? Which variable(s) could be considered as features?
  4. Identify examples of misuse of machine learning in society. What was the ethical concern?

🏠

References

De Cock, Dean. 2011. “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project.” Journal of Statistics Education 19 (3).
Fisher, Ronald A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88.
Harrison Jr, David, and Daniel L Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102.