Lesson 7.2: Vision Transformers (ViTs)
Vision Transformers (ViTs) adapt the Transformer architecture (originally designed for NLP) to process images by treating them as sequences of patches. Here’s a step-by-step breakdown:
- 1. Image Patch Embedding
- An input image is split into fixed-size non-overlapping patches (e.g., 16×16 pixels).
- Each patch is flattened into a 1D vector and linearly projected into a patch embedding (like word embeddings in NLP).
- 2. Positional Embeddings
- Since Transformers lack inherent spatial awareness, learned positional embeddings are added to patch embeddings to retain location information.
- Class Token (Optional)
- A special [CLS] token (borrowed from BERT) is prepended to the sequence. Its final state serves as the global image representation for classification.
- 4. Transformer Encoder
- The sequence (patch embeddings + [CLS]) is fed into a standard Transformer encoder:
- Multi-Head Self-Attention: Patches "attend" to each other to capture global relationships.
- MLP Layers: Non-linear transformations for feature refinement.
- Layer Normalization & Residual Connections: Stabilize training.
- The sequence (patch embeddings + [CLS]) is fed into a standard Transformer encoder:
- 5. Classification Head
- The [CLS] token’s output is passed through an MLP for tasks like image classification.