K-nearest Neighbor

class: misk-title-slide

<br><br><br><br><br>
# .font140[K-nearest Neighbor]

---
# Prerequisites

.pull-left[

```r
# Helper packages
library(dplyr)      # for data wrangling
library(ggplot2)    # for awesome graphics
library(rsample)    # for creating validation splits
library(recipes)    # for feature engineering

# Modeling packages
library(caret)       # for fitting KNN models
```

]

.pull-right[

```r
# Ames housing data
ames <- AmesHousing::make_ames()
set.seed(123)
split  <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- rsample::training(split)

# create training (70%) set for the rsample::attrition data.
attrit <- attrition %>% mutate_if(is.ordered, factor, ordered = FALSE)
set.seed(123)
churn_split <- initial_split(attrit, prop = .7, strata = "Attrition")
churn_train <- training(churn_split)

# import MNIST training data
mnist <- dslabs::read_mnist()
names(mnist)
## [1] "train" "test"
```

]

---
# Measuring similarity

.pull-left[

* KNN algorithm identifies `$k$` observations that are "similar" or nearest to the new record being predicted

* Real estate analogy --> determine what price they will list (or market) a home for based on "comps" (comparable homes)

* Homes that have very similar attributes to the one being sold (e.g., square footage, number of rooms, style of the home, neighborhood and school district)

]

.pull-right[

<div id="htmlwidget-e2dbee53c0c4ca2e9381" style="width:504px;height:576px;" class="leaflet html-widget"></div>
<script type="application/json" data-for="htmlwidget-e2dbee53c0c4ca2e9381">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["//{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"© <a href=\"http://openstreetmap.org\">OpenStreetMap<\/a> contributors, <a href=\"http://creativecommons.org/licenses/by-sa/2.0/\">CC-BY-SA<\/a>"}]},{"method":"addCircleMarkers","args":[[42.062352,42.0623985,42.057028,42.058782,42.062046,42.062992,42.062789,42.059561,42.062991,42.059197,42.063141],[-93.653201,-93.657049,-93.655587,-93.65486,-93.658873,-93.654144,-93.657835,-93.650917,-93.654241,-93.65045,-93.654243],10,null,null,{"interactive":true,"className":"","stroke":false,"color":["#FF0000","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF"],"weight":5,"opacity":0.5,"fill":true,"fillColor":["#FF0000","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF","#0000FF"],"fillOpacity":0.75},null,null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[42.057028,42.063141],"lng":[-93.658873,-93.65045]}},"evals":[],"jsHooks":[]}</script>

]

---
# Distance measures

How do we determine the similarity between observations (or homes as in the previous example)?

.pull-left[

`\begin{equation}
 \text{Euclidean: }\sqrt{\sum^P_{j=1}(x_{aj} - x_{bj})^2}
\end{equation}`

]

.pull-right[

`\begin{equation}
 \text{Manhattan: }\sum^P_{j=1} | x_{aj} - x_{bj} | 
\end{equation}`

]

To illustrate let's look at two homes:

```r
(two_houses <- ames_train[1:2, c("Gr_Liv_Area", "Year_Built")])
## # A tibble: 2 x 2
##   Gr_Liv_Area Year_Built
##         <int>      <int>
## 1        1656       1960
## 2         896       1961
```

---
# Distance measures

How do we determine the similarity between observations (or homes as in the previous example)?

.pull-left[

`\begin{equation}
 \text{Euclidean: }\sqrt{\sum^P_{j=1}(x_{aj} - x_{bj})^2}
\end{equation}`

```r
dist(two_houses, method = "euclidean")
##          1
## 2 760.0007
```

]

.pull-right[

`\begin{equation}
 \text{Manhattan: }\sum^P_{j=1} | x_{aj} - x_{bj} | 
\end{equation}`

```r
dist(two_houses, method = "manhattan")
##     1
## 2 761
```

]

---
# Pre-processing

.pull-left[

* Due to the squaring in the Euclidean distance function, the Euclidean distance is more sensitive to outliers. 
* Furthermore, most distance measures are sensitive to the scale of the features. 
* Data with features that have different scales will bias the distance measures as those predictors with the largest values will contribute most to the distance between two samples.  
* For example, consider the three home below: `home1` is a four bedroom built in 2008, `home2` is a two bedroom built in the same year, and `home3` is a three bedroom built a decade earlier.
]

.pull-right[

```r
home1
## # A tibble: 1 x 4
##   home  Bedroom_AbvGr Year_Built    id
##   <chr>         <int>      <int> <int>
## 1 home1             4       2008   423
home2
## # A tibble: 1 x 4
##   home  Bedroom_AbvGr Year_Built    id
##   <chr>         <int>      <int> <int>
## 1 home2             2       2008   424
home3
## # A tibble: 1 x 4
##   home  Bedroom_AbvGr Year_Built    id
##   <chr>         <int>      <int> <int>
## 1 home3             3       1998     6
```

]

---
# Pre-processing

.pull-left[

]

.pull-right[

```r
features <- c("Bedroom_AbvGr", "Year_Built")

# distance between home 1 and 2
dist(rbind(home1[,features], home2[,features]))
##   1
## 2 2

# distance between home 1 and 3
dist(rbind(home1[,features], home3[,features]))
##          1
## 2 10.04988
```

]

<br>

.center.bold[The Euclidean distance between `home1` and `home3` is larger due to the larger difference in `Year_Built` with `home2`.]

---
# Pre-processing

.scrollable90[
.pull-left[

]

.pull-right[

```r
home1_std
## # A tibble: 1 x 4
##   home  Bedroom_AbvGr Year_Built    id
##   <chr>         <dbl>      <dbl> <int>
## 1 home1          1.38       1.21   423
home2_std
## # A tibble: 1 x 4
##   home  Bedroom_AbvGr Year_Built    id
##   <chr>         <dbl>      <dbl> <int>
## 1 home2         -1.03       1.21   424
home3_std
## # A tibble: 1 x 4
##   home  Bedroom_AbvGr Year_Built    id
##   <chr>         <dbl>      <dbl> <int>
## 1 home3         0.176      0.881     6

# distance between home 1 and 2
dist(rbind(home1_std[,features], home2_std[,features]))
##          1
## 2 2.416244

# distance between home 1 and 3
dist(rbind(home1_std[,features], home3_std[,features]))
##          1
## 2 1.252547
```

]]

---
# Choosing K

.scrollable90[
.pull-left[

* `$k$` is our one hyperparameter!
* When `$k = 1$`, we base our prediction on a single observation that has the closest distance measure.
* When `$k = n$`, we are simply using the average (regression) or most common class (classification) across all training samples as our predicted value.
* No general rule about the best `$k$` as it depends greatly on the nature of the data. 
* For high signal data with very few noisy (irrelevant) features, smaller values of `$k$` tend to work best. As more irrelevant features are involved, larger values of `$k$` are required to smooth out the noise
* .bold[Pro Tip]: When using KNN for classification, it is best to assess odd numbers for `$k$` to avoid ties in the event there is equal proportion of response levels.

]

.pull-right[

```r
# Create blueprint
blueprint <- recipe(Attrition ~ ., data = churn_train) %>%
  step_nzv(all_nominal()) %>%
  step_integer(contains("Satisfaction")) %>%
  step_integer(WorkLifeBalance) %>%
  step_integer(JobInvolvement) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())

# Create a resampling method
cv <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 5,
  classProbs = TRUE,                 
  summaryFunction = twoClassSummary
)

# Create a hyperparameter grid search
hyper_grid <- expand.grid(
  k = floor(seq(1, nrow(churn_train)/3, length.out = 20))
)

# Fit knn model and perform grid search
knn_grid <- train(
  blueprint, 
  data = churn_train, 
  method = "knn", 
  trControl = cv, 
  tuneGrid = hyper_grid,
  metric = "ROC"
)

ggplot(knn_grid)
```

]]

---
# MNIST

.pull-left[

* Due to size let's just take a subset

]

.pull-right[

```r
set.seed(123)
index <- sample(nrow(mnist$train$images), size = 10000)
mnist_x <- mnist$train$images[index, ]
mnist_y <- factor(mnist$train$labels[index])
```

]

---
# MNIST

.pull-left[

* Due to size let's just take a subset
* Lots of near-zero variance features

]

.pull-right[

```r
mnist_x %>%
  as.data.frame() %>%
  purrr::map_df(sd) %>%
  gather(feature, sd) %>%
  ggplot(aes(sd)) +
  geom_histogram(binwidth = 1)
```

]

---
# MNIST

.pull-left[

* Due to size let's just take a subset
* Lots of near-zero variance features

]

.pull-right[

<div class="figure" style="text-align: center">
<img src="08-knn-slides_files/figure-html/mnist-plot-nzv-feature-image-1.png" alt="Example images (A)-(C) from our data set and (D) highlights near-zero variance features around the edges of our images."  />
<p class="caption">Example images (A)-(C) from our data set and (D) highlights near-zero variance features around the edges of our images.</p>
</div>

]

---
# MNIST

.pull-left[

.pull-right[

```r
# Rename features
colnames(mnist_x) <- paste0("V", 1:ncol(mnist_x))

# Remove near zero variance features manually
nzv <- nearZeroVar(mnist_x)
index <- setdiff(1:ncol(mnist_x), nzv)
mnist_x <- mnist_x[, index]
```

]

---
# MNIST

.scrollable90[
.pull-left[

* Due to size let's just take a subset
* Lots of near-zero variance features
* Removing these zero (or near-zero) variance features, we end up keeping 46 of the original 249 predictors
   - can cause dramatic improvements to both the accuracy and speed of our algorithm
* .bold.red[Warning]: Our hyperparameter grid search assesses 13 k values between 1–25 and takes approximately 3 minutes.

]

.pull-right[

```r
# Use train/validate resampling method
cv <- trainControl(
  method = "LGOCV", 
  p = 0.7,
  number = 1,
  savePredictions = TRUE
)

# Create a hyperparameter grid search
hyper_grid <- expand.grid(k = seq(3, 25, by = 2))

# Execute grid search
knn_mnist <- train(
  mnist_x,
  mnist_y,
  method = "knn",
  tuneGrid = hyper_grid,
  preProc = c("center", "scale"),
  trControl = cv
)

ggplot(knn_mnist)
```

<div class="figure" style="text-align: center">
<img src="08-knn-slides_files/figure-html/mnist-initial-model-1.png" alt="KNN search grid results for the MNIST data"  />
<p class="caption">KNN search grid results for the MNIST data</p>
</div>

]]

---
# Results

.pull-left[

* 94% accuracy rate

* hardest to detect
   - 8s
   - 4s
   - 3s
   - 2s

]

.pull-right[

```r
# Create confusion matrix
cm <- confusionMatrix(knn_mnist$pred$pred, knn_mnist$pred$obs)
cm$byClass[, c(1:2, 11)]  # sensitivity, specificity, & accuracy
##          Sensitivity Specificity Balanced Accuracy
## Class: 0   0.9641638   0.9962374         0.9802006
## Class: 1   0.9916667   0.9841210         0.9878938
## Class: 2   0.9155666   0.9955114         0.9555390
## Class: 3   0.9163952   0.9920325         0.9542139
## Class: 4   0.8698630   0.9960538         0.9329584
## Class: 5   0.9151404   0.9914891         0.9533148
## Class: 6   0.9795322   0.9888684         0.9842003
## Class: 7   0.9326520   0.9896962         0.9611741
## Class: 8   0.8224382   0.9978798         0.9101590
## Class: 9   0.9329897   0.9852687         0.9591292
```

]

---
# Visualizing correct & incorrect predictions

.scrollable90[

```r
# Get a few accurate predictions
set.seed(9)
good <- knn_mnist$pred %>%
  filter(pred == obs) %>%
  sample_n(4)

# Get a few inaccurate predictions
set.seed(9)
bad <- knn_mnist$pred %>%
  filter(pred != obs) %>%
  sample_n(4)

combine <- bind_rows(good, bad)

# Get original feature set with all pixel features
set.seed(123)
index <- sample(nrow(mnist$train$images), 10000)
X <- mnist$train$images[index,]

# Plot results
par(mfrow = c(4, 2), mar=c(1, 1, 1, 1))
layout(matrix(seq_len(nrow(combine)), 4, 2, byrow = FALSE))
for(i in seq_len(nrow(combine))) {
  image(matrix(X[combine$rowIndex[i],], 28, 28)[, 28:1], 
        col = gray(seq(0, 1, 0.05)),
        main = paste("Actual:", combine$obs[i], "  ", 
                     "Predicted:", combine$pred[i]),
        xaxt="n", yaxt="n") 
}
```

<div class="figure" style="text-align: center">
<img src="08-knn-slides_files/figure-html/correct-vs-incorrect-1.png" alt="Actual images from the MNIST data set along with our KNN model's predictions.  Left column illustrates a few accurate predictions and the right column illustrates a few inaccurate predictions."  />
<p class="caption">Actual images from the MNIST data set along with our KNN model's predictions.  Left column illustrates a few accurate predictions and the right column illustrates a few inaccurate predictions.</p>
</div>

]

---
# Summary

- KNNs are a very simplistic, and intuitive, algorithm that can provide average to decent predictive power, especially when the response is dependent on the local structure of the features

- Major drawback of KNNs is their computation time, which increases by `$n \times p$` for each observation

- Furthermore, since KNNs are a lazy learner, they require the model be run at prediction time which limits their use for real-time modeling

- Although KNNs rarely provide the best predictive performance, they have many benefits, for example, in feature engineering and in data cleaning and preprocessing

---
class: clear, center, middle, hide-logo

background-image: url(images/any-questions.jpg)
background-position: center
background-size: cover

---
# Back home

<br><br><br><br>
[.center[<span><i class="fas  fa-home fa-10x faa-FALSE animated "></i></span>]](https://github.com/misk-data-science/misk-homl)

.center[https://github.com/misk-data-science/misk-homl]