Lesson 1.9: Supervised Learning: K-Nearest Neighbours


KNN is a simple yet powerful instance-based (or lazy learning) algorithm used for classification and regression. For classification, it assigns a class label to an unlabeled example based on the majority class among its kk closest neighbors in the training data.

Nearest Neighbor Classification

  • Classifies an unlabeled example by assigning it the class of the most similar labeled examples.
  • 1-NN: The simplest case where k=1. The label is assigned based on the single closest neighbor.

K-Nearest Neighbors (KNN) Classification

  • Generalizes the idea by considering k closest neighbors.
  • The class label is determined by majority voting among the k neighbors.

Calculating Distance

KNN relies on distance metrics to determine similarity. Common distance measures include:

Manhattan Distance (L1 Norm)

d(x,y)=i=1nxiyid(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |x_i - y_i|

  • Sum of absolute differences between coordinates.

Euclidean Distance (L2 Norm)

d(x,y)=i=1n(xiyi)2d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

  • Straight-line distance between points in space.

Choosing the Right kk

Small kk (e.g., k=1k = 1 )

  • Highly sensitive to noise (overfitting).
  • Decision boundary is more complex.

Large kk

  • Smoother decision boundaries (underfitting).
  • Computationally expensive.

Common Practices for Selecting k:k:

  • Typically set between 3 and 10.
  • A heuristic: knk \approx \sqrt{n} where nn is the number of training samples.

Min-Max Normalization

Since KNN is distance-based, features should be normalized to prevent bias from large-valued features. Xnorm=XXminXmaxXminX_{\text{norm}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

  • Scales all features to [0,1].
  • Ensures equal contribution from all features in distance calculation.

Lazy Learning (Instance-Based Learning)

  • No explicit training phase: Simply stores the training data ("rote learning").
  • Prediction phase is slower: Requires computing distances to all training points for each query.
  • Advantages:

    • Adapts immediately to new data.
    • No assumptions about data distribution.
  • Disadvantages:

    • Computationally expensive for large datasets.
    • Sensitive to irrelevant/noisy features.
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.