class: misk-title-slide <br><br><br><br><br> # .font140[K-nearest Neighbor] --- # Prerequisites .pull-left[ ```r # Helper packages library(dplyr) # for data wrangling library(ggplot2) # for awesome graphics library(rsample) # for creating validation splits library(recipes) # for feature engineering # Modeling packages library(caret) # for fitting KNN models ``` ] .pull-right[ ```r # Ames housing data ames <- AmesHousing::make_ames() set.seed(123) split <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price") ames_train <- rsample::training(split) # create training (70%) set for the rsample::attrition data. attrit <- attrition %>% mutate_if(is.ordered, factor, ordered = FALSE) set.seed(123) churn_split <- initial_split(attrit, prop = .7, strata = "Attrition") churn_train <- training(churn_split) # import MNIST training data mnist <- dslabs::read_mnist() names(mnist) ## [1] "train" "test" ``` ] --- # Measuring similarity .pull-left[ * KNN algorithm identifies `\(k\)` observations that are "similar" or nearest to the new record being predicted * Real estate analogy --> determine what price they will list (or market) a home for based on "comps" (comparable homes) * Homes that have very similar attributes to the one being sold (e.g., square footage, number of rooms, style of the home, neighborhood and school district) ] .pull-right[
] --- # Distance measures How do we determine the similarity between observations (or homes as in the previous example)? .pull-left[ `\begin{equation} \text{Euclidean: }\sqrt{\sum^P_{j=1}(x_{aj} - x_{bj})^2} \end{equation}` ] .pull-right[ `\begin{equation} \text{Manhattan: }\sum^P_{j=1} | x_{aj} - x_{bj} | \end{equation}` ] To illustrate let's look at two homes: ```r (two_houses <- ames_train[1:2, c("Gr_Liv_Area", "Year_Built")]) ## # A tibble: 2 x 2 ## Gr_Liv_Area Year_Built ## <int> <int> ## 1 1656 1960 ## 2 896 1961 ``` --- # Distance measures How do we determine the similarity between observations (or homes as in the previous example)? .pull-left[ `\begin{equation} \text{Euclidean: }\sqrt{\sum^P_{j=1}(x_{aj} - x_{bj})^2} \end{equation}` ```r dist(two_houses, method = "euclidean") ## 1 ## 2 760.0007 ``` <img src="08-knn-slides_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] .pull-right[ `\begin{equation} \text{Manhattan: }\sum^P_{j=1} | x_{aj} - x_{bj} | \end{equation}` ```r dist(two_houses, method = "manhattan") ## 1 ## 2 761 ``` <img src="08-knn-slides_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- # Pre-processing .pull-left[ * Due to the squaring in the Euclidean distance function, the Euclidean distance is more sensitive to outliers. * Furthermore, most distance measures are sensitive to the scale of the features. * Data with features that have different scales will bias the distance measures as those predictors with the largest values will contribute most to the distance between two samples. * For example, consider the three home below: `home1` is a four bedroom built in 2008, `home2` is a two bedroom built in the same year, and `home3` is a three bedroom built a decade earlier. ] .pull-right[ ```r home1 ## # A tibble: 1 x 4 ## home Bedroom_AbvGr Year_Built id ## <chr> <int> <int> <int> ## 1 home1 4 2008 423 home2 ## # A tibble: 1 x 4 ## home Bedroom_AbvGr Year_Built id ## <chr> <int> <int> <int> ## 1 home2 2 2008 424 home3 ## # A tibble: 1 x 4 ## home Bedroom_AbvGr Year_Built id ## <chr> <int> <int> <int> ## 1 home3 3 1998 6 ``` ] --- # Pre-processing .pull-left[ * Due to the squaring in the Euclidean distance function, the Euclidean distance is more sensitive to outliers. * Furthermore, most distance measures are sensitive to the scale of the features. * Data with features that have different scales will bias the distance measures as those predictors with the largest values will contribute most to the distance between two samples. * For example, consider the three home below: `home1` is a four bedroom built in 2008, `home2` is a two bedroom built in the same year, and `home3` is a three bedroom built a decade earlier. ] .pull-right[ ```r features <- c("Bedroom_AbvGr", "Year_Built") # distance between home 1 and 2 dist(rbind(home1[,features], home2[,features])) ## 1 ## 2 2 # distance between home 1 and 3 dist(rbind(home1[,features], home3[,features])) ## 1 ## 2 10.04988 ``` ] <br> .center.bold[The Euclidean distance between `home1` and `home3` is larger due to the larger difference in `Year_Built` with `home2`.] --- # Pre-processing .scrollable90[ .pull-left[ * Due to the squaring in the Euclidean distance function, the Euclidean distance is more sensitive to outliers. * Furthermore, most distance measures are sensitive to the scale of the features. * Data with features that have different scales will bias the distance measures as those predictors with the largest values will contribute most to the distance between two samples. * For example, consider the three home below: `home1` is a four bedroom built in 2008, `home2` is a two bedroom built in the same year, and `home3` is a three bedroom built a decade earlier. * .bold[Standardizing eliminates these magnitude differences.] ] .pull-right[ ```r home1_std ## # A tibble: 1 x 4 ## home Bedroom_AbvGr Year_Built id ## <chr> <dbl> <dbl> <int> ## 1 home1 1.38 1.21 423 home2_std ## # A tibble: 1 x 4 ## home Bedroom_AbvGr Year_Built id ## <chr> <dbl> <dbl> <int> ## 1 home2 -1.03 1.21 424 home3_std ## # A tibble: 1 x 4 ## home Bedroom_AbvGr Year_Built id ## <chr> <dbl> <dbl> <int> ## 1 home3 0.176 0.881 6 # distance between home 1 and 2 dist(rbind(home1_std[,features], home2_std[,features])) ## 1 ## 2 2.416244 # distance between home 1 and 3 dist(rbind(home1_std[,features], home3_std[,features])) ## 1 ## 2 1.252547 ``` ]] --- # Choosing K .scrollable90[ .pull-left[ * `\(k\)` is our one hyperparameter! * When `\(k = 1\)`, we base our prediction on a single observation that has the closest distance measure. * When `\(k = n\)`, we are simply using the average (regression) or most common class (classification) across all training samples as our predicted value. * No general rule about the best `\(k\)` as it depends greatly on the nature of the data. * For high signal data with very few noisy (irrelevant) features, smaller values of `\(k\)` tend to work best. As more irrelevant features are involved, larger values of `\(k\)` are required to smooth out the noise * .bold[Pro Tip]: When using KNN for classification, it is best to assess odd numbers for `\(k\)` to avoid ties in the event there is equal proportion of response levels. ] .pull-right[ ```r # Create blueprint blueprint <- recipe(Attrition ~ ., data = churn_train) %>% step_nzv(all_nominal()) %>% step_integer(contains("Satisfaction")) %>% step_integer(WorkLifeBalance) %>% step_integer(JobInvolvement) %>% step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% step_center(all_numeric(), -all_outcomes()) %>% step_scale(all_numeric(), -all_outcomes()) # Create a resampling method cv <- trainControl( method = "repeatedcv", number = 10, repeats = 5, classProbs = TRUE, summaryFunction = twoClassSummary ) # Create a hyperparameter grid search hyper_grid <- expand.grid( k = floor(seq(1, nrow(churn_train)/3, length.out = 20)) ) # Fit knn model and perform grid search knn_grid <- train( blueprint, data = churn_train, method = "knn", trControl = cv, tuneGrid = hyper_grid, metric = "ROC" ) ggplot(knn_grid) ``` <img src="08-knn-slides_files/figure-html/range-k-values-1.png" style="display: block; margin: auto;" /> ]] --- # MNIST .pull-left[ * Due to size let's just take a subset ] .pull-right[ ```r set.seed(123) index <- sample(nrow(mnist$train$images), size = 10000) mnist_x <- mnist$train$images[index, ] mnist_y <- factor(mnist$train$labels[index]) ``` ] --- # MNIST .pull-left[ * Due to size let's just take a subset * Lots of near-zero variance features ] .pull-right[ ```r mnist_x %>% as.data.frame() %>% purrr::map_df(sd) %>% gather(feature, sd) %>% ggplot(aes(sd)) + geom_histogram(binwidth = 1) ``` <img src="08-knn-slides_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- # MNIST .pull-left[ * Due to size let's just take a subset * Lots of near-zero variance features ] .pull-right[ <div class="figure" style="text-align: center"> <img src="08-knn-slides_files/figure-html/mnist-plot-nzv-feature-image-1.png" alt="Example images (A)-(C) from our data set and (D) highlights near-zero variance features around the edges of our images." /> <p class="caption">Example images (A)-(C) from our data set and (D) highlights near-zero variance features around the edges of our images.</p> </div> ] --- # MNIST .pull-left[ * Due to size let's just take a subset * Lots of near-zero variance features * Removing these zero (or near-zero) variance features, we end up keeping 46 of the original 249 predictors - can cause dramatic improvements to both the accuracy and speed of our algorithm ] .pull-right[ ```r # Rename features colnames(mnist_x) <- paste0("V", 1:ncol(mnist_x)) # Remove near zero variance features manually nzv <- nearZeroVar(mnist_x) index <- setdiff(1:ncol(mnist_x), nzv) mnist_x <- mnist_x[, index] ``` ] --- # MNIST .scrollable90[ .pull-left[ * Due to size let's just take a subset * Lots of near-zero variance features * Removing these zero (or near-zero) variance features, we end up keeping 46 of the original 249 predictors - can cause dramatic improvements to both the accuracy and speed of our algorithm * .bold.red[Warning]: Our hyperparameter grid search assesses 13 k values between 1–25 and takes approximately 3 minutes. ] .pull-right[ ```r # Use train/validate resampling method cv <- trainControl( method = "LGOCV", p = 0.7, number = 1, savePredictions = TRUE ) # Create a hyperparameter grid search hyper_grid <- expand.grid(k = seq(3, 25, by = 2)) # Execute grid search knn_mnist <- train( mnist_x, mnist_y, method = "knn", tuneGrid = hyper_grid, preProc = c("center", "scale"), trControl = cv ) ggplot(knn_mnist) ``` <div class="figure" style="text-align: center"> <img src="08-knn-slides_files/figure-html/mnist-initial-model-1.png" alt="KNN search grid results for the MNIST data" /> <p class="caption">KNN search grid results for the MNIST data</p> </div> ]] --- # Results .pull-left[ * 94% accuracy rate * hardest to detect - 8s - 4s - 3s - 2s ] .pull-right[ ```r # Create confusion matrix cm <- confusionMatrix(knn_mnist$pred$pred, knn_mnist$pred$obs) cm$byClass[, c(1:2, 11)] # sensitivity, specificity, & accuracy ## Sensitivity Specificity Balanced Accuracy ## Class: 0 0.9641638 0.9962374 0.9802006 ## Class: 1 0.9916667 0.9841210 0.9878938 ## Class: 2 0.9155666 0.9955114 0.9555390 ## Class: 3 0.9163952 0.9920325 0.9542139 ## Class: 4 0.8698630 0.9960538 0.9329584 ## Class: 5 0.9151404 0.9914891 0.9533148 ## Class: 6 0.9795322 0.9888684 0.9842003 ## Class: 7 0.9326520 0.9896962 0.9611741 ## Class: 8 0.8224382 0.9978798 0.9101590 ## Class: 9 0.9329897 0.9852687 0.9591292 ``` ] --- # Visualizing correct & incorrect predictions .scrollable90[ ```r # Get a few accurate predictions set.seed(9) good <- knn_mnist$pred %>% filter(pred == obs) %>% sample_n(4) # Get a few inaccurate predictions set.seed(9) bad <- knn_mnist$pred %>% filter(pred != obs) %>% sample_n(4) combine <- bind_rows(good, bad) # Get original feature set with all pixel features set.seed(123) index <- sample(nrow(mnist$train$images), 10000) X <- mnist$train$images[index,] # Plot results par(mfrow = c(4, 2), mar=c(1, 1, 1, 1)) layout(matrix(seq_len(nrow(combine)), 4, 2, byrow = FALSE)) for(i in seq_len(nrow(combine))) { image(matrix(X[combine$rowIndex[i],], 28, 28)[, 28:1], col = gray(seq(0, 1, 0.05)), main = paste("Actual:", combine$obs[i], " ", "Predicted:", combine$pred[i]), xaxt="n", yaxt="n") } ``` <div class="figure" style="text-align: center"> <img src="08-knn-slides_files/figure-html/correct-vs-incorrect-1.png" alt="Actual images from the MNIST data set along with our KNN model's predictions. Left column illustrates a few accurate predictions and the right column illustrates a few inaccurate predictions." /> <p class="caption">Actual images from the MNIST data set along with our KNN model's predictions. Left column illustrates a few accurate predictions and the right column illustrates a few inaccurate predictions.</p> </div> ] --- # Summary - KNNs are a very simplistic, and intuitive, algorithm that can provide average to decent predictive power, especially when the response is dependent on the local structure of the features - Major drawback of KNNs is their computation time, which increases by `\(n \times p\)` for each observation - Furthermore, since KNNs are a lazy learner, they require the model be run at prediction time which limits their use for real-time modeling - Although KNNs rarely provide the best predictive performance, they have many benefits, for example, in feature engineering and in data cleaning and preprocessing --- class: clear, center, middle, hide-logo background-image: url(images/any-questions.jpg) background-position: center background-size: cover --- # Back home <br><br><br><br> [.center[
<i class="fas fa-home fa-10x faa-FALSE animated "></i>
]](https://github.com/misk-data-science/misk-homl) .center[https://github.com/misk-data-science/misk-homl]