Machine learning (ML) continues to grow in importance for many organizations across nearly all domains. Some example applications of machine learning in practice include:
In essence, these tasks all seek to learn from data. To address each scenario, we can use a given set of features to train an algorithm and extract insights. These algorithms, or learners, can be classified according to the amount and type of supervision needed during training.
This module will introduce you to some fundamental concepts around ML and this class. By the end of this module you will:
A predictive model is used for tasks that involve the prediction of a given output (or target) using other variables (or features) in the data set. The learning algorithm in a predictive model attempts to discover and model the relationships among the target variable (the variable being predicted) and the other features (aka predictor variables). Examples of predictive modeling include:
Each of these examples has a defined learning task; they each intend to use attributes (\(X\)) to predict an outcome measurement (\(Y\)).
Throughout this course we’ll use various terms interchangeably for
The predictive modeling examples above describe what is known as supervised learning. The supervision refers to the fact that the target values provide a supervisory role, which indicates to the learner the task it needs to learn. Specifically, given a set of data, the learning algorithm attempts to optimize a function (the algorithmic steps) to find the combination of feature values that results in a predicted value that is as close to the actual target output as possible.
In supervised learning, the training data you feed the algorithm includes the target values. Consequently, the solutions can be used to help supervise the training process to find the optimal algorithm parameters.
Most supervised learning problems can be bucketed into one of two categories, regression or classification, which we discuss next.
When the objective of our supervised learning is to predict a numeric outcome, we refer to this as a regression problem (not to be confused with linear regression modeling). Regression problems revolve around predicting output that falls on a continuum. In the examples above, predicting home sales prices and time to market reflect a regression problem because the output is numeric and continuous. This means, given the combination of predictor values, the response value could fall anywhere along some continuous spectrum (e.g., the predicted sales price of a particular home could be between $80,000 and $755,000). The figure below illustrates average home sales prices as a function of two home features: year built and total square footage. Depending on the combination of these two features, the expected home sales price could fall anywhere along a plane.
When the objective of our supervised learning is to predict a categorical outcome, we refer to this as a classification problem. Classification problems most commonly revolve around predicting a binary or multinomial response measure such as:
However, when we apply machine learning models for classification problems, rather than predict a particular class (i.e., “yes” or “no”), we often want to predict the probability of a particular class (i.e., yes: 0.65, no: 0.35). By default, the class with the highest predicted probability becomes the predicted class. Consequently, even though we are performing a classification problem, we are still predicting a numeric output (probability). However, the essence of the problem still makes it a classification problem.
Although there are machine learning algorithms that can be applied to regression problems but not classification and vice versa, most of the supervised learning algorithms we cover in this module can be applied to both. These algorithms have become the most popular machine learning applications in recent years.
Unsupervised learning, in contrast to supervised learning, includes a set of statistical tools to better understand and describe your data, but performs the analysis without a target variable. In essence, unsupervised learning is concerned with identifying groups in a data set. The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction); however, the motive in each case is quite different.
The goal of clustering is to segment observations into similar groups based on the observed variables; for example, to divide consumers into different homogeneous groups, a process known as market segmentation. In dimension reduction, we are often concerned with reducing the number of variables in a data set. For example, classical linear regression models break down in the presence of highly correlated features. Some dimension reduction techniques can be used to reduce the feature set to a potentially smaller set of uncorrelated variables. Such a reduced feature set is often used as input to downstream supervised learning models (e.g., principal component regression).
Unsupervised learning is often performed as part of an exploratory data analysis (EDA). However, the exercise tends to be more subjective, and there is no simple goal for the analysis, such as prediction of a response. Furthermore, it can be hard to assess the quality of results obtained from unsupervised learning methods. The reason for this is simple. If we fit a predictive model using a supervised learning technique (i.e., linear regression), then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised!
Despite its subjectivity, the importance of unsupervised learning should not be overlooked and such techniques are often used in organizations to:
These questions, and many more, can be addressed with unsupervised learning. Moreover, the outputs of unsupervised learning models can be used as inputs to downstream supervised learning models.
The goal of this course is to provide effective tools for uncovering relevant and useful patterns in your data by using the R and Python ML ecosystems with a focus on supervised learning. The progression of this course is designed around 4 themes:
The lessons are designed to help you understand the individual sub-tasks of an ML project. The focus is to have an intuitive understanding of each discrete sub-task. Once you understand when, where, and why these sub-tasks are performed you will be able to transfer this knowledge to other projects. The concepts you will learn include:
To help your understanding we provide code recipes in both R and Python so that you can start implementing these ML sub-tasks in both languages. These recipes will include
At the end of each module we provide additional exercises for you to perform. These exercises force you to apply your knew knowledge on different data sets. For the exercises we use small well-understood data sets because:
Newcomers to the world of ML can have a difficult time extrapolating what they have learned and applying the ML process to new, more complex data sets that don’t have benchmark examples. The portfolio builder exercises are designed to get you working through this challenge by identifying new data sets and applying the ML process to uncover patterns in the data. Whether patterns, strong or weak, exist is not the most important goal; rather, getting comfortable working through the ML process on new data sets where examples are not prevelant is!
You will always find a debate between which language is “best” for machine learning – Python or R. Unfortunately this is a poor way to think about ML and these two languages.
First, your objective should be to understand the fundamental machine learning concepts. Second, you should have a solid understanding of how to apply these concepts in either language because which one you use can largely be driven by the culture of the organization you work for.
Consequently, this course will illustrate how to apply machine learning in both languages. Code recipes will be supplied in Python () and R () tabs as illustrated here:
Scikit-learn is the predominate Python package for machine learning. Unlike R, scikit-learn provides nearly all components required for the modeling process (i.e. sampling, feature engineering, modeling, evaluation). Scikit-learn is part of the SciPy ecosystem, which is a group of Python libraries for mathematics, science and engineering. Other packages that you will commonly use for ML in Python include:
The following Python packages are used throughout this module. You may want to use a virtual environment; however, most code recipes should run regardless of small deviations in package versions.
# data management
pip install -U pandas
pip install -U numpy
# data visualization
pip install -U matplotlib
pip install -U plotnine
# modeling
pip install -U scikit-learn
import sklearn
sklearn.__version__
## '0.24.2'
Historically, the R ecosystem provides a wide variety of ML algorithm implementations. This has its benefits; however, this also drawbacks as it requires the users to learn many different formula interfaces and syntax nuances. More recently, development on a group of packages called Tidymodels has helped to make implementation easier.
Whereas in Python you can perform most, if not all, of the ML sub-tasks with scikit-learn, the tidymodels collection allows you to perform discrete parts of the ML workflow with discrete packages:
The following R packages are used throughout this module. You may want to use a virtual environment; however, most code recipes should run regardless of small deviations in package versions. Note that when you install tidymodels you are actually installing several packages that exist in the tidymodels framework as discussed above.
# common data wrangling and visualization
install.packages("tidyverse")
install.packages("vip")
# modeling
install.packages("tidymodels")
packageVersion("tidymodels")
## [1] '0.1.3'
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.3 ──
## ✓ broom 0.7.7 ✓ rsample 0.1.0
## ✓ dials 0.0.9 ✓ tibble 3.1.2
## ✓ dplyr 1.0.7 ✓ tidyr 1.1.3
## ✓ infer 0.5.4 ✓ tune 0.1.5
## ✓ modeldata 0.1.0 ✓ workflows 0.2.2
## ✓ parsnip 0.1.6 ✓ workflowsets 0.0.2
## ✓ purrr 0.3.4 ✓ yardstick 0.0.8
## ✓ recipes 0.1.16
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
The data sets chosen for this course allow us to illustrate the different features of the presented machine learning algorithms. Since the goal of this course is to demonstrate how to implement ML workflows, we make the assumption that you have already spent significant time cleaning and getting to know your data via EDA. This would allow you to perform many necessary tasks prior to the ML tasks outlined in this course such as:
Consequently, the exemplar data sets we use throughout this book have, for the most part, gone through the necessary cleaning processes. As mentioned above, these data sets are fairly common data sets that provide good benchmarks to compare and illustrate ML workflows. Although some of these data sets are available in R and/or Python, we will import these data sets from a .csv file to ensure commonality regardless of language.
The Boston Housing data set is derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. Originally published in Harrison Jr and Rubinfeld (1978) , it contains 13 attributes to predict the median property value.
medv
median value of owner-occupied homes in USD 1000’s (i.e. 21.8, 24.5)# Pandas has already been imported
import pandas as pd
# access data
boston = pd.read_csv("data/boston.csv")
# initial dimensions
boston.shape
## (506, 16)
# features
boston.drop("cmedv", axis=1).head()
## lon lat crim zn indus ... rad tax ptratio b lstat
## 0 -70.955 42.2550 0.00632 18.0 2.31 ... 1 296 15.3 396.90 4.98
## 1 -70.950 42.2875 0.02731 0.0 7.07 ... 2 242 17.8 396.90 9.14
## 2 -70.936 42.2830 0.02729 0.0 7.07 ... 2 242 17.8 392.83 4.03
## 3 -70.928 42.2930 0.03237 0.0 2.18 ... 3 222 18.7 394.63 2.94
## 4 -70.922 42.2980 0.06905 0.0 2.18 ... 3 222 18.7 396.90 5.33
##
## [5 rows x 15 columns]
# response variable
boston["cmedv"].head()
## 0 24.0
## 1 21.6
## 2 34.7
## 3 33.4
## 4 36.2
## Name: cmedv, dtype: float64
# access data
boston <- readr::read_csv("data/boston.csv")
# initial dimension
dim(boston)
## [1] 506 16
# features
dplyr::select(boston, -cmedv)
## # A tibble: 506 x 15
## lon lat crim zn indus chas nox rm age dis rad tax
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -71.0 42.3 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296
## 2 -71.0 42.3 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242
## 3 -70.9 42.3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242
## 4 -70.9 42.3 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222
## 5 -70.9 42.3 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222
## 6 -70.9 42.3 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3 222
## 7 -70.9 42.3 0.0883 12.5 7.87 0 0.524 6.01 66.6 5.56 5 311
## 8 -70.9 42.3 0.145 12.5 7.87 0 0.524 6.17 96.1 5.95 5 311
## 9 -70.9 42.3 0.211 12.5 7.87 0 0.524 5.63 100 6.08 5 311
## 10 -70.9 42.3 0.170 12.5 7.87 0 0.524 6.00 85.9 6.59 5 311
## # … with 496 more rows, and 3 more variables: ptratio <dbl>, b <dbl>,
## # lstat <dbl>
# response variable
head(boston$cmedv)
## [1] 24.0 21.6 34.7 33.4 36.2 28.7
A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases and published in smith1988using , it contains 8 attributes to predict the presence of diabetes.
diabetes
positive or negative response (i.e. “pos,” “neg”)# Pandas has already been imported
# import pandas as pd
# access data
pima = pd.read_csv("data/pima.csv")
# initial dimensions
pima.shape
## (768, 9)
# features
pima.drop("diabetes", axis=1).head()
## pregnant glucose pressure triceps insulin mass pedigree age
## 0 6 148 72 35 0 33.6 0.627 50
## 1 1 85 66 29 0 26.6 0.351 31
## 2 8 183 64 0 0 23.3 0.672 32
## 3 1 89 66 23 94 28.1 0.167 21
## 4 0 137 40 35 168 43.1 2.288 33
# response variable
pima["diabetes"].head()
## 0 pos
## 1 neg
## 2 pos
## 3 neg
## 4 pos
## Name: diabetes, dtype: object
# access data
pima <- readr::read_csv("data/pima.csv")
# initial dimension
dim(pima)
## [1] 768 9
# features
dplyr::select(pima, -diabetes)
## # A tibble: 768 x 8
## pregnant glucose pressure triceps insulin mass pedigree age
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6 0.627 50
## 2 1 85 66 29 0 26.6 0.351 31
## 3 8 183 64 0 0 23.3 0.672 32
## 4 1 89 66 23 94 28.1 0.167 21
## 5 0 137 40 35 168 43.1 2.29 33
## 6 5 116 74 0 0 25.6 0.201 30
## 7 3 78 50 32 88 31 0.248 26
## 8 10 115 0 0 0 35.3 0.134 29
## 9 2 197 70 45 543 30.5 0.158 53
## 10 8 125 96 0 0 0 0.232 54
## # … with 758 more rows
# response variable
head(pima$diabetes)
## [1] "pos" "neg" "pos" "neg" "pos" "neg"
The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper (Fisher 1936) . It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
species
(i.e. “setosa,” “virginica,” “versicolor”)# Pandas has already been imported
# import pandas as pd
# access data
iris = pd.read_csv("data/iris.csv")
# initial dimensions
iris.shape
## (150, 5)
# features
iris.drop("Species", axis=1).head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0 5.1 3.5 1.4 0.2
## 1 4.9 3.0 1.4 0.2
## 2 4.7 3.2 1.3 0.2
## 3 4.6 3.1 1.5 0.2
## 4 5.0 3.6 1.4 0.2
# response variable
iris["Species"].head()
## 0 setosa
## 1 setosa
## 2 setosa
## 3 setosa
## 4 setosa
## Name: Species, dtype: object
# access data
iris <- readr::read_csv("data/iris.csv")
# initial dimension
dim(iris)
## [1] 150 5
# features
dplyr::select(iris, -Species)
## # A tibble: 150 x 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## 8 5 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 4.9 3.1 1.5 0.1
## # … with 140 more rows
# response variable
head(iris$Species)
## [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
The Ames housing data set is an alternative to the Boston housing data set and provides a more comprehensive set of home features to predict sales price. More information can be found in De Cock (2011) .
Sale_Price
(i.e., $195,000, $215,000)# Pandas has already been imported
# import pandas as pd
# access data
ames = pd.read_csv("data/ames.csv")
# initial dimensions
ames.shape
## (2930, 81)
# features
ames.drop("Sale_Price", axis=1).head()
## MS_SubClass ... Latitude
## 0 One_Story_1946_and_Newer_All_Styles ... 42.054035
## 1 One_Story_1946_and_Newer_All_Styles ... 42.053014
## 2 One_Story_1946_and_Newer_All_Styles ... 42.052659
## 3 One_Story_1946_and_Newer_All_Styles ... 42.051245
## 4 Two_Story_1946_and_Newer ... 42.060899
##
## [5 rows x 80 columns]
# response variable
ames["Sale_Price"].head()
## 0 215000
## 1 105000
## 2 172000
## 3 244000
## 4 189900
## Name: Sale_Price, dtype: int64
# access data
ames <- readr::read_csv("data/ames.csv")
# initial dimension
dim(ames)
## [1] 2930 81
# features
dplyr::select(ames, -Sale_Price)
## # A tibble: 2,930 x 80
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 One_Story_1946_… Residential… 141 31770 Pave No_All… Slightly_…
## 2 One_Story_1946_… Residential… 80 11622 Pave No_All… Regular
## 3 One_Story_1946_… Residential… 81 14267 Pave No_All… Slightly_…
## 4 One_Story_1946_… Residential… 93 11160 Pave No_All… Regular
## 5 Two_Story_1946_… Residential… 74 13830 Pave No_All… Slightly_…
## 6 Two_Story_1946_… Residential… 78 9978 Pave No_All… Slightly_…
## 7 One_Story_PUD_1… Residential… 41 4920 Pave No_All… Regular
## 8 One_Story_PUD_1… Residential… 43 5005 Pave No_All… Slightly_…
## 9 One_Story_PUD_1… Residential… 39 5389 Pave No_All… Slightly_…
## 10 Two_Story_1946_… Residential… 60 7500 Pave No_All… Regular
## # … with 2,920 more rows, and 73 more variables: Land_Contour <chr>,
## # Utilities <chr>, Lot_Config <chr>, Land_Slope <chr>, Neighborhood <chr>,
## # Condition_1 <chr>, Condition_2 <chr>, Bldg_Type <chr>, House_Style <chr>,
## # Overall_Qual <chr>, Overall_Cond <chr>, Year_Built <dbl>,
## # Year_Remod_Add <dbl>, Roof_Style <chr>, Roof_Matl <chr>,
## # Exterior_1st <chr>, Exterior_2nd <chr>, Mas_Vnr_Type <chr>,
## # Mas_Vnr_Area <dbl>, Exter_Qual <chr>, Exter_Cond <chr>, Foundation <chr>,
## # Bsmt_Qual <chr>, Bsmt_Cond <chr>, Bsmt_Exposure <chr>,
## # BsmtFin_Type_1 <chr>, BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <chr>,
## # BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, Heating <chr>,
## # Heating_QC <chr>, Central_Air <chr>, Electrical <chr>, First_Flr_SF <dbl>,
## # Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>, Gr_Liv_Area <dbl>,
## # Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, Full_Bath <dbl>,
## # Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
## # Kitchen_Qual <chr>, TotRms_AbvGrd <dbl>, Functional <chr>,
## # Fireplaces <dbl>, Fireplace_Qu <chr>, Garage_Type <chr>,
## # Garage_Finish <chr>, Garage_Cars <dbl>, Garage_Area <dbl>,
## # Garage_Qual <chr>, Garage_Cond <chr>, Paved_Drive <chr>,
## # Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, Enclosed_Porch <dbl>,
## # Three_season_porch <dbl>, Screen_Porch <dbl>, Pool_Area <dbl>,
## # Pool_QC <chr>, Fence <chr>, Misc_Feature <chr>, Misc_Val <dbl>,
## # Mo_Sold <dbl>, Year_Sold <dbl>, Sale_Type <chr>, Sale_Condition <chr>,
## # Longitude <dbl>, Latitude <dbl>
# response variable
head(ames$Sale_Price)
## [1] 215000 105000 172000 244000 189900 195500
The employee attrition data set was originally provided by IBM Watson Analytics Lab and is a fictional data set created by IBM data scientists to explore what employee attributes influence attrition.
Attrition
(i.e., “Yes,” “No”)# Pandas has already been imported
# import pandas as pd
# access data
attrition = pd.read_csv("data/attrition.csv")
# initial dimensions
attrition.shape
## (1470, 31)
# features
attrition.drop("Attrition", axis=1).head()
## Age BusinessTravel ... YearsSinceLastPromotion YearsWithCurrManager
## 0 41 Travel_Rarely ... 0 5
## 1 49 Travel_Frequently ... 1 7
## 2 37 Travel_Rarely ... 0 0
## 3 33 Travel_Frequently ... 3 0
## 4 27 Travel_Rarely ... 2 2
##
## [5 rows x 30 columns]
# response variable
attrition["Attrition"].head()
## 0 Yes
## 1 No
## 2 Yes
## 3 No
## 4 No
## Name: Attrition, dtype: object
# access data
attrition <- readr::read_csv("data/attrition.csv")
# initial dimension
dim(attrition)
## [1] 1470 31
# features
dplyr::select(attrition, -Attrition)
## # A tibble: 1,470 x 30
## Age BusinessTravel DailyRate Department DistanceFromHome Education
## <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 41 Travel_Rarely 1102 Sales 1 College
## 2 49 Travel_Frequent… 279 Research_Devel… 8 Below_Coll…
## 3 37 Travel_Rarely 1373 Research_Devel… 2 College
## 4 33 Travel_Frequent… 1392 Research_Devel… 3 Master
## 5 27 Travel_Rarely 591 Research_Devel… 2 Below_Coll…
## 6 32 Travel_Frequent… 1005 Research_Devel… 2 College
## 7 59 Travel_Rarely 1324 Research_Devel… 3 Bachelor
## 8 30 Travel_Rarely 1358 Research_Devel… 24 Below_Coll…
## 9 38 Travel_Frequent… 216 Research_Devel… 23 Bachelor
## 10 36 Travel_Rarely 1299 Research_Devel… 27 Bachelor
## # … with 1,460 more rows, and 24 more variables: EducationField <chr>,
## # EnvironmentSatisfaction <chr>, Gender <chr>, HourlyRate <dbl>,
## # JobInvolvement <chr>, JobLevel <dbl>, JobRole <chr>, JobSatisfaction <chr>,
## # MaritalStatus <chr>, MonthlyIncome <dbl>, MonthlyRate <dbl>,
## # NumCompaniesWorked <dbl>, OverTime <chr>, PercentSalaryHike <dbl>,
## # PerformanceRating <chr>, RelationshipSatisfaction <chr>,
## # StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## # TrainingTimesLastYear <dbl>, WorkLifeBalance <chr>, YearsAtCompany <dbl>,
## # YearsInCurrentRole <dbl>, YearsSinceLastPromotion <dbl>,
## # YearsWithCurrManager <dbl>
# response variable
head(attrition$Attrition)
## [1] "Yes" "No" "Yes" "No" "No" "No"
This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. The idea was to illustrate if and how major league baseball player’s batting performance could predict their salary. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York. Note that the data does contain the players name but this should be removed during analysis and is not a valid feature.
Salary
# access data
hitters = pd.read_csv("data/hitters.csv")
# initial dimensions
hitters.shape
## (322, 21)
# features
hitters.drop(["Salary", "Player"], axis=1).head()
## AtBat Hits HmRun Runs RBI ... Division PutOuts Assists Errors NewLeague
## 0 293 66 1 30 29 ... E 446 33 20 A
## 1 315 81 7 24 38 ... W 632 43 10 N
## 2 479 130 18 66 72 ... W 880 82 14 A
## 3 496 141 20 65 78 ... E 200 11 3 N
## 4 321 87 10 39 42 ... E 805 40 4 N
##
## [5 rows x 19 columns]
# response variable
hitters["Salary"].head()
## 0 NaN
## 1 475.0
## 2 480.0
## 3 500.0
## 4 91.5
## Name: Salary, dtype: float64
# access data
hitters <- readr::read_csv("data/hitters.csv")
# initial dimension
dim(hitters)
## [1] 322 21
# features
dplyr::select(hitters, -Salary, -Player)
## # A tibble: 322 x 19
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 293 66 1 30 29 14 1 293 66 1 30 29
## 2 315 81 7 24 38 39 14 3449 835 69 321 414
## 3 479 130 18 66 72 76 3 1624 457 63 224 266
## 4 496 141 20 65 78 37 11 5628 1575 225 828 838
## 5 321 87 10 39 42 30 2 396 101 12 48 46
## 6 594 169 4 74 51 35 11 4408 1133 19 501 336
## 7 185 37 1 23 8 21 2 214 42 1 30 9
## 8 298 73 0 24 24 7 3 509 108 0 41 37
## 9 323 81 6 26 32 8 2 341 86 6 32 34
## 10 401 92 17 49 66 65 13 5206 1332 253 784 890
## # … with 312 more rows, and 7 more variables: CWalks <dbl>, League <chr>,
## # Division <chr>, PutOuts <dbl>, Assists <dbl>, Errors <dbl>, NewLeague <chr>
# response variable
head(hitters$Salary)
## [1] NA 475.0 480.0 500.0 91.5 750.0