Lesson 7.2: Vision Transformers (ViTs)

Vision Transformers (ViTs) adapt the Transformer architecture (originally designed for NLP) to process images by treating them as sequences of patches. Here’s a step-by-step breakdown:

1. Image Patch Embedding
- An input image is split into fixed-size non-overlapping patches (e.g., 16×16 pixels).
- Each patch is flattened into a 1D vector and linearly projected into a patch embedding (like word embeddings in NLP).
2. Positional Embeddings
- Since Transformers lack inherent spatial awareness, learned positional embeddings are added to patch embeddings to retain location information.
Class Token (Optional)
- A special [CLS] token (borrowed from BERT) is prepended to the sequence. Its final state serves as the global image representation for classification.
4. Transformer Encoder
- The sequence (patch embeddings + [CLS]) is fed into a standard Transformer encoder:
  - Multi-Head Self-Attention: Patches "attend" to each other to capture global relationships.
  - MLP Layers: Non-linear transformations for feature refinement.
  - Layer Normalization & Residual Connections: Stabilize training.
5. Classification Head
- The [CLS] token’s output is passed through an MLP for tasks like image classification.