Identify one of the open source datasets you found during module 1’s exercise. Create a new Github repository that will focus on analyzing this dataset. Write up an initial report that answers the following:
- Assess the distribution of the target / response variable.
- Is the response skewed?
- Does applying a transformation normalize the distribution?
- Assess the dataset for missingness.
- How many observations have missing values?
- Plot the missing values. Does there appear to be any patterns to the missing values?
- How do you think the different imputation approaches would impact modeling results?
- Assess the variance across the features.
- Do any features have zero variance?
- Do any features have near-zero variance?
- Assess the numeric features.
- Do some features have significant skewness?
- Do features have a wide range of values that would benefit from standardization?
- Assess the categorical features.
- Are categorical levels equally spread out across the features or is “lumping” occurring?
- Which values do you think should be one-hot or dummy encoded versus label encoded? Why?
- Execute a basic feature engineering process.
- First, apply a KNN model to your data without pre-applying feature engineering processes.
- Create and a apply a blueprint of feature engineering processes that you think will help your model improve.
- Now reapply the KNN model to your data that has been feature engineered.
- Did your model performance improve?
🏠