Lesson 3.6: Attention


Key Issues with Recurrent Models (RNNs/LSTMs)

  • Linear Interaction Distance

  • For distant word pairs (e.g., "The cat [...] sat"), RNNs require O(sequence length) steps to interact.

  • Problem:

    • Vanishing/exploding gradients make it hard to learn long-range dependencies.
    • Linear word order is artificially enforced, while language is hierarchical (e.g., syntax trees).
  • Sequential Computation

    • RNNs process tokens one at a time, creating a bottleneck:
      • Forward/backward passes require O(sequence length) unparallelizable steps.
      • GPUs thrive on parallel computation, but RNNs force sequential dependency.
  • Information Bottleneck

    • The final hidden state of an encoder RNN must compress all source information into a fixed-size vector.
    • Result: Loss of fine-grained details, especially for long sequences.

Sequence-to-Sequence (Seq2Seq) with Attention

Core idea: on each step of the decoder, use direct connection to the encoder to focus on a particular part of the source sequence

  1. Input Processing (Encoder)

    • The input sequence (e.g., a sentence like "How are you?") is fed into the encoder (usually a BiLSTM or Transformer).
    • Each word is converted into a vector (embedding), and the encoder processes them sequentially.
    • Key steps:
      1. Embedding
        • Words → Vectors:
        • "How" → [0.2, -0.5, ...], "are" → [0.7, 0.1, ...], ...
      2. Hidden States:
        • The encoder generates a hidden state hih_i for each word, capturing its context:
        • h₁ ("How"), h₂ ("are"), h₃ ("you"), h₄ ("?")
        • Output: A sequence of encoder states h1,h2,h3,h4h_1,h_2,h_3,h_4
  2. Decoder Initialization

    • The decoder (usually an LSTM) starts with:
    • Initial hidden state: Set to the encoder’s last state s0=h4s_0 = h_4
    • First input: A special <SOS> (Start-of-Sequence) token or also <EOS>.
    • Purpose: Prepares the decoder to generate the first word of the output (e.g., a translation).
  3. Decoder Step with Attention (Iterative Process)

    • For each output word (e.g., generating Spanish "¿Cómo estás?"):
    • Step 3.1: Compute Attention Weights
      • The decoder calculates how much to "focus" on each encoder state hih_i for the current step.
      • Uses the decoder’s previous hidden state (st1)(s_t - 1) and all the encoder states hih_i.
      • Scores: For each encoder state hih_i :
        • eti=score(st1,hi)=vTtanh(W[st1;hi]+b)e_{ti} = \text{score}(s_{t-1}, h_i) = v^T \tanh(W[s_{t-1}; h_i] + b)
      • Weights: Softmax turns scores into probabilities
        • αti=exp(eti)jexp(etj)\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_j \exp(e_{tj})}
        • Example: For the decoder step generating "Cómo",αt1α_t1​ (for "How") might be 0.9.
    • Step 3.2: Create Context Vector
      • The encoder states are weighted by αtiα_ti and summed:
      • ct=iαtihic_t = \sum_i \alpha_{ti} h_i
      • Example: If generating "Cómo", cth1c_t ≈ h_1 (since "How" → "Cómo").
    • Step 3.3: Update Decoder State
      • Previous hidden state (st1)(s_t - 1)
      • Context vector ctc_t
      • Previously predicted word (or <SOS> for the first step).
      • st=LSTM([yt1;ct],st1)s_t = \text{LSTM}([y_{t-1}; c_t], s_{t-1})
    • Step 3.4: Predict Next Word
      • The decoder predicts a probability distribution over the output vocabulary:
      • p(yt)=softmax(Wost+bo)p(y_t) = \text{softmax}(W_o s_t + b_o)
    • Repeat: Steps 3.1 - 3.4 until <EOS> (End-of-Sequence) is generated.

Attention is Great

  • Attention significantly improves NMT performance
    • It’s very useful to allow decoder to focus on certain parts of the source
  • Attention provides a more “human-like” model of the MT process
    • You can look back at the source sentence while translating, rather than needing to remember it all
  • Attention solves the bottleneck problem
    • Attention allows decoder to look directly at source; bypass bottleneck
  • Attention helps with the vanishing gradient problem
    • Provides shortcut to faraway states
  • Attention provides some interpretability
    • By inspecting attention distribution, we see what the decoder was focusing on
    • We get (soft) alignment for free!
    • This is cool because we never explicitly trained an alignment system
    • The network just learned alignment by itself

Attention in equations

Why Attention Solves Parallelization & Bottleneck Problems

  • Parallel Computation of Attention Scores
    • All attention scores can be computed simultaneously for the entire sequence:
    • Scores: eij=qiTkjdki,j\text{Scores: } e_{ij} = \frac{q_i^T k_j}{\sqrt{d_k}} \quad \forall i,j
    • (Where qiq_i = query for position ii, kjk_j = key for position jj)
  • Eliminating Sequential Bottlenecks
    • Unlike RNNs which require O(N) sequential operations, attention computes all positions at once:
    • Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
      • Q,K,VQ,K,V are matrices containing all queries/keys/values
      • Single matrix multiplication replaces sequential processing
  • Constant Interaction Distance
    • Any two words can interact directly in one layer:
    • Interaction distance=O(1) (not O(N) like RNNs)\text{Interaction distance} = O(1) \text{ (not } O(N) \text{ like RNNs)}
  • Self-Attention Within a Sentence
    • The same mechanism works for encoder self-attention:
    • hi=j=1NαijWvhjwhere αij=softmax(eij)h_i' = \sum_{j=1}^N \alpha_{ij} W_v h_j \quad \text{where } \alpha_{ij} = \text{softmax}(e_{ij})
      • Each position ii attends to all positions jj in parallel
      • No information bottleneck - all positions remain accessible

Why This Matters

  • Training Speed: Attention layers utilize GPU parallelism fully
  • Model Quality: Direct access to all positions helps learn complex dependencies
  • Scalability: Constant path length regardless of sequence length

Basic Attention Framework

For query q, values V = v1,...,vN{v_1, ..., v_N}, and keys K = k1,...,kN{k_1, ..., k_N}:

  1. Attention Scores: ei=score(q,ki)e_i = \text{score}(q, k_i)
  2. Attention Distribution (softmax): αi=exp(ei)j=1Nexp(ej)\alpha_i = \frac{\exp(e_i)}{\sum_{j=1}^N \exp(e_j)}
  3. Context Vector (weighted sum): c=i=1Nαivic = \sum_{i=1}^N \alpha_i v_i

Common Attention Variants

  1. Dot-Product Attention
    • ei=qTkie_i = q^T k_i
    • Pros: Computationally efficient.
    • Cons: Scores can grow large in magnitude (\Rightarrow unstable gradients).
  2. Scaled Dot-Product (Transformer Attention)
    • ei=qTkidke_i = \frac{q^T k_i}{\sqrt{d_k}}
    • Why scale?: Prevents gradient saturation when dkd_k (key dimension) is large.
  3. Additive Attention
    • ei=vTtanh(Wqq+Wkki+b)e_i = v^T \tanh(W_q q + W_k k_i + b)
    • Pros: More expressive.
    • Cons: Slower (extra parameters Wq,Wk,vW_q, W_k, v).
  4. Multi-Head Attention
    • headi=Attention(Wqiq,WkiK,WviV),MultiHead(q,K,V)=Wo[head1;...;headh]\footnotesize{ \text{head}_i = \text{Attention}(W_q^i q, W_k^i K, W_v^i V) , \text{MultiHead}(q,K,V) = W_o [\text{head}_1; ...; \text{head}_h] }
    • Key idea: Parallel attention heads capture different relationships.
  5. Self-Attention (Special Case)
    • When queries, keys, and values come from the same sequence (qi,ki,viq_i, k_i, v_i are all linear transforms of input xix_i):

More general definition of attention:

  • Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.
  • Intuition:
    • The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
    • Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).
  • Upshot:
    • Attention has become the powerful, flexible, general way pointer and memory manipulation in all deep learning models.

Barriers and Solutions for Self-Attention as a Building Block

Barrier 1: Lack of Inherent Order

  • Problem:
    • Self-attention treats inputs as a bag of words—it has no built-in notion of position.
    • For example: "Dog bites man" vs. "Man bites dog" would have identical representations without positional cues.
  • Solution: Positional Encodings Inject position information into the input embeddings. Two main approaches:
  1. Sinusoidal Positional Encodings
  • Idea: Use fixed, periodic sinusoidal functions to encode positions.
  • Equation:
    • For position pospos and dimension ii
    • PE(pos,2i)=sin(pos100002i/dmodel)PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \\ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
    • (Where dmodeld_{model} = embedding dimension)
    • Intuition:
      • Periodicity: Different frequencies capture varying scales of positional relationships.
      • Extrapolation: Theoretically generalizes to longer sequences (though rarely works in practice).
    • Pros:
      • No learned parameters (fixed).
      • Periodicity may help generalization.
    • Cons:
      • Not adaptive to data.
      • Fails to extrapolate reliably.
  1. Learned Positional Embeddings
    • Idea: Treat positions as learnable vectors (like word embeddings).
    • Equation:
      • PE=[p1p2pN]T(Learnable matrix)PE = \begin{bmatrix} p_1 & p_2 & \dots & p_N \end{bmatrix}^T \quad \text{(Learnable matrix)}
      • Each piRdmodelp_i ∈R d_{model} is trained via backpropagation.
    • Pros:
      • Adapts to data (e.g., learns syntax-aware positions).
      • Simpler implementation.
    • Cons:
      • Cannot extrapolate beyond trained sequence length NN.
      • Requires more memory.
    • Used By: Most modern Transformers (e.g., BERT, GPT).

How Positional Encodings Are Integrated

  • Step: Add (or concatenate) positional embeddings to input word embeddings:
    • xi=xi+pi(Most common)orxi=[xi;pi]x_i' = x_i + p_i \quad \text{(Most common)} \\ \text{or} \quad x_i' = [x_i; p_i]
  • Why Addition?
    • Empirically works better than concatenation (fewer parameters, similar performance).
    • The Transformer’s self-attention can "disentangle" position and content implicitly.

Barrier 2: Lack of Nonlinearity in Self-Attention

  • Problem: Self-attention is fundamentally a weighted average of value vectors. Without nonlinearities, stacking multiple self-attention layers is equivalent to a single averaged representation:
  • SelfAttention(X)=softmax(XWQ(XWK)Tdk)XWV\text{SelfAttention}(X) = \text{softmax}\left(\frac{XW_Q (XW_K)^T}{\sqrt{d_k}}\right) XW_V
  • No "deep learning magic": Pure linear transformations + softmax (which is monotonic) cannot learn complex hierarchical features.
  • Why: Stacking NN linear self-attention layers reduces to a single linear operation:
    • SelfAttentionN(X)=X(WQWKTWV)N\text{SelfAttention}^N(X) = X \cdot (W_Q W_K^T W_V)^N
    • (No new representational power is gained!)
  • Solution: Add Feedforward Networks (FFNs)
    • Fix: Apply a position-wise FFN after each self-attention layer to introduce nonlinearity.
    • Equation: For each output vector hih_i from self-attention:
      • hi=ReLU(hiW1+b1)W2+b2h_i' = \text{ReLU}(h_i W_1 + b_1) W_2 + b_2
      • (Where W1RddffandW2RdffdW_1 ∈ \mathbb{R}^{d*d_{ff}} \text{and} W_2 ∈ \mathbb{R}^{d_{ff}*d} )
    • Key Properties:
      • Nonlinearity: ReLU breaks linearity, enabling hierarchical feature learning.
      • Position-Wise: Applied independently to each token (parallelizable).
    • Why This Works
      • Self-Attention: Mixes information across positions (weighted average).
      • FFN: Processes each position independently with nonlinear transformations.
        • Think of it as a "per-token MLP" that refines features.
      • Analogy:
        • Self-attention = "global communication" (tokens talk to each other).
        • FFN = "local computation" (each token thinks independently).

Barrier 3: Preventing Future Peeking in Decoders

  • Problem: In tasks like language modeling or machine translation, the model must predict the next word without seeing future tokens.

  • Example: When predicting "estás" in "¿Cómo ___?", the model shouldn’t see "estás" or "?" in the input.

  • Challenge: Standard self-attention attends to all tokens in parallel → leaks future info.

  • Solution: Masked Self-Attention

    • Artificially set attention weights to 0 for future positions.
  • Step-by-Step Implementation

    1. Compute Attention Scores (as usual):
      • eij=qiTkjdke_{ij} = \frac{q_i^T k_j}{\sqrt{d_k}}
    2. Apply Mask (for decoder at position i):
      • Effect: Future tokens get exp(-∞) = 0 after softmax
      • e~ij={eijif ji(past/present)if j>i(future)\tilde{e}_{ij} = \begin{cases} e_{ij} & \text{if } j \leq i \quad \text{(past/present)} \\ -\infty & \text{if } j > i \quad \text{(future)} \end{cases}
    3. Compute Attention Output:
      • αij=softmax(e~ij)ci=j=1iαijvj\alpha_{ij} = \text{softmax}(\tilde{e}_{ij}) \\ c_i = \sum_{j=1}^i \alpha_{ij} v_j
  • Why This Works

    • Parallelization: All positions computed simultaneously (unlike RNNs), but future tokens are masked.
    • Causality: Each token only attends to itself and prior tokens.
    • Efficiency: Implemented via a lower-triangular mask (see code below).

The 4 Essential Components of a Self-Attention Building Block

  1. Self-Attention (Core Mechanism)
    • Allows each token to dynamically focus on relevant parts of the sequence.
    • Computes relationships between all pairs of tokens.
    • Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
      • Queries (Q): What each token is looking for.
      • Keys (K): What each token contains.
      • Values (V): Actual content being retrieved.
    • Without self-attention, the model has no way to learn contextual relationships between tokens.
  2. Position Representations
    • Injects information about token order since self-attention is permutation-equivariant.
    • Sinusoidal Encoding (Fixed):
    • Learned Embeddings (Trainable):
    • Implementation: x = token_embeddings + position_embeddings # Additive (most common)
  3. Nonlinearities (Feed-Forward Network)
    • Introduces nonlinear transformations to enable hierarchical feature learning.
    • FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
    • Why It's Needed: Pure self-attention is just weighted averaging - FFNs add representational power.
  4. Masking (for Autoregressive Tasks)
    • Prevents the model from "cheating" by looking at future tokens during training.
    • Why It's Needed: Ensures predictions at position i depend only on tokens 1 to i-1 (crucial for language modeling).

How These Components Work Together

  • Input: Token embeddings + positional encodings.
  • Self-Attention: Computes contextual relationships.
  • Masking: Optional - applied in decoder for autoregressive tasks.
  • FFN: Further processes each token's representation.
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.