Portfolio Builder Exercise #1

Identify one of the open source datasets you found during module 1’s exercise. Create a new Github repository that will focus on analyzing this dataset. Write up an initial report that answers the following:

Assess the distribution of the target / response variable.
- Is the response skewed?
- Does applying a transformation normalize the distribution?
Assess the dataset for missingness.
- How many observations have missing values?
- Plot the missing values. Does there appear to be any patterns to the missing values?
- How do you think the different imputation approaches would impact modeling results?
Assess the variance across the features.
- Do any features have zero variance?
- Do any features have near-zero variance?
Assess the numeric features.
- Do some features have significant skewness?
- Do features have a wide range of values that would benefit from standardization?
Assess the categorical features.
- Are categorical levels equally spread out across the features or is “lumping” occurring?
- Which values do you think should be one-hot or dummy encoded versus label encoded? Why?
Execute a basic feature engineering process.
- First, apply a KNN model to your data without pre-applying feature engineering processes.
- Create and a apply a blueprint of feature engineering processes that you think will help your model improve.
- Now reapply the KNN model to your data that has been feature engineered.
- Did your model performance improve?

🏠