Lesson 1.9: Supervised Learning: K-Nearest Neighbours
KNN is a simple yet powerful instance-based (or lazy learning) algorithm used for classification and regression. For classification, it assigns a class label to an unlabeled example based on the majority class among its closest neighbors in the training data.
Nearest Neighbor Classification
- Classifies an unlabeled example by assigning it the class of the most similar labeled examples.
- 1-NN: The simplest case where k=1. The label is assigned based on the single closest neighbor.
K-Nearest Neighbors (KNN) Classification
- Generalizes the idea by considering k closest neighbors.
- The class label is determined by majority voting among the k neighbors.
Calculating Distance
KNN relies on distance metrics to determine similarity. Common distance measures include:
Manhattan Distance (L1 Norm)
- Sum of absolute differences between coordinates.
Euclidean Distance (L2 Norm)
- Straight-line distance between points in space.
Choosing the Right
Small (e.g., )
- Highly sensitive to noise (overfitting).
- Decision boundary is more complex.
Large
- Smoother decision boundaries (underfitting).
- Computationally expensive.
Common Practices for Selecting
- Typically set between 3 and 10.
- A heuristic: where is the number of training samples.
Min-Max Normalization
Since KNN is distance-based, features should be normalized to prevent bias from large-valued features.
- Scales all features to [0,1].
- Ensures equal contribution from all features in distance calculation.
Lazy Learning (Instance-Based Learning)
- No explicit training phase: Simply stores the training data ("rote learning").
- Prediction phase is slower: Requires computing distances to all training points for each query.
-
Advantages:
- Adapts immediately to new data.
- No assumptions about data distribution.
-
Disadvantages:
- Computationally expensive for large datasets.
- Sensitive to irrelevant/noisy features.