Lesson 6.5: Knowledge Distillation (KD)

Knowledge Distillation (KD) is a model compression technique where a smaller, more efficient model (the student) is trained to replicate the behavior of a larger, more powerful model (the teacher). The goal is to retain the teacher’s performance while reducing computational costs, memory usage, and latency. Below, we break down how KD works in LLMs, its key components, and different approaches.

Core Components of Knowledge Distillation

(1) Teacher Model
- A large, high-performance LLM (e.g., GPT-4, LLaMA-2-70B, PaLM).
- Acts as the "expert" whose knowledge is transferred to the student.
(2) Student Model
- A smaller, lightweight model (e.g., DistilBERT, TinyLLaMA, GPT-Neo).
- Designed for efficiency (faster inference, lower memory footprint).
(3) Knowledge Transfer Mechanisms
- Soft Targets: Probabilistic outputs from the teacher (not just hard labels).
- Temperature Scaling: Smoothens probability distributions for better learning.
- Loss Functions: Combines distillation loss with standard supervised loss.

How Knowledge Distillation Works

Step 1: Generate Soft Targets
- Instead of training the student on hard labels (e.g., "the correct answer is A"), the teacher provides soft targets—a probability distribution over possible outputs.
- Example: Text Classification
  - Teacher Output (Soft Targets): [0.7 (class A), 0.2 (class B), 0.1 (class C)]
  - Student Learns:
    - Not just that "A is correct," but also B and C are plausible alternatives.
    - Preserves generalization better than hard labels.
Step 2: Train the Student Model
- Distillation Loss (KL Divergence):
  - Minimizes the difference between teacher and student soft targets
  - Measures how different the student’s probabilities are from the teacher’s.
  - Goal: Make the student’s outputs match the teacher’s nuanced probabilities.
- Supervised Loss (Cross-Entropy):
  - Ensures the student also learns from ground-truth labels
  - Goal: Don’t ignore the original training labels.
Step 3: Temperature Scaling
- Adjusting the softmax temperature smooths probability distributions, enhancing the student ability to learn implicit knowledge.
- Without temperature, the teacher’s probabilities might be too "sharp" (e.g., [0.99, 0.01]).
- The student learns "only class A matters" and ignores useful hints (e.g., "class B is slightly possible").
- Low Temperature (T=1): Sharp probabilities (close to hard labels).
- High Temperature (T>1): Smoother probabilities, revealing dark knowledge (e.g., "B is more likely than C").

Black-Box vs. White-Box Knowledge Distillation (KD)

In Knowledge Distillation (KD), the method of transferring knowledge from a teacher model to a student model can be categorized based on how much internal information from the teacher is used.

Black-Box Distillation
- Only the final outputs (predictions) of the teacher model are used.
- The student never sees the teacher’s internal weights, attention patterns, or hidden states.
- Advantages
  - Simple to implement (no need for teacher’s internal code).
  - Works with third-party APIs (e.g., distilling GPT-4 without access to its weights).
- Limitations
  - Less efficient (ignores the teacher’s rich internal representations).
  - Student may miss subtle patterns (e.g., how the teacher handles rare words).
White-Box Distillation
- Leverages the teacher’s internal mechanisms: Hidden layer activations, Attention weights, Gradient signals.
- Advantages
  - More accurate (student learns how the teacher reasons).
  - Better for small students (e.g., compressing BERT → TinyBERT).
- Limitations
  - Requires full access to the teacher’s architecture/weights.
  - Computationally heavier.