Lesson 7.2: Vision Transformers (ViTs)


Vision Transformers (ViTs) adapt the Transformer architecture (originally designed for NLP) to process images by treating them as sequences of patches. Here’s a step-by-step breakdown:

  • 1. Image Patch Embedding
    • An input image is split into fixed-size non-overlapping patches (e.g., 16×16 pixels).
    • Each patch is flattened into a 1D vector and linearly projected into a patch embedding (like word embeddings in NLP).
  • 2. Positional Embeddings
    • Since Transformers lack inherent spatial awareness, learned positional embeddings are added to patch embeddings to retain location information.
  • Class Token (Optional)
    • A special [CLS] token (borrowed from BERT) is prepended to the sequence. Its final state serves as the global image representation for classification.
  • 4. Transformer Encoder
    • The sequence (patch embeddings + [CLS]) is fed into a standard Transformer encoder:
      • Multi-Head Self-Attention: Patches "attend" to each other to capture global relationships.
      • MLP Layers: Non-linear transformations for feature refinement.
      • Layer Normalization & Residual Connections: Stabilize training.
  • 5. Classification Head
    • The [CLS] token’s output is passed through an MLP for tasks like image classification.
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.