Lesson 6.4.4 : DPO
Direct Preference Optimization (DPO) is a simpler, more efficient alternative to RLHF that eliminates the need for a reward model and reinforcement learning. Instead, it directly optimizes the LLM using preference data via a loss function derived from human rankings.The main idea is to make the reward model obsolete because the LLM could learn directly to increase the probability of completions which humans preferred and decrease the probabilty of less preferred completions.
How DPO Works
- Pretraining a Base Model:
- Similar to RLHF, a base LLM is pretrained on a large text dataset.
- Collecting Human Preferences:
- We let the LLM produce pairs of examples and humans rate which example are better than another. Same as with RHLF the better example gets labelled positive while the worse one gets negative.
- Humans rank pairs of model outputs (e.g., "Response A is preferred over Response B").
- Direct Optimization:
- Instead of training a reward model, DPO uses a cross-entropy-based loss function to:
- Increase the probability of preferred responses.
- Decrease the probability of rejected responses.
- A reference model (usually the initial SFT model) ensures the LLM doesn’t drift too far from its original behavior.
- Instead of training a reward model, DPO uses a cross-entropy-based loss function to:
Advantages Over RLHF
- Simpler: No reward model or RL needed.
- More Stable: Uses standard gradient descent instead of RL.
- Faster Training: Fewer computational steps.
- Better Performance: Often outperforms RLHF in practice.
When to Use DPO?
- When you want efficient preference-based fine-tuning.
- When RLHF is too complex or computationally expensive.
Comparison Summary
Aspect | RLHF | DPO |
---|---|---|
Mechanism | Uses reinforcement learning (PPO) + reward model | Direct optimization using preference data |
Training Steps | Complex (reward model + RL fine-tuning) | Simple (single loss function) |
Stability | Less stable (RL tuning challenges) | More stable (gradient-based optimization) |
Compute Cost | High (requires reward model + RL) | Low (no RL or reward model) |
Performance | Good, but sensitive to reward model quality | Often better and more consistent |
Limitations
-
- Data Quality Dependency: Requires clean, diverse preference pairs. Noisy data degrades results.
-
- Reference Model Bias: Performance hinges on the initial reference model’s quality.
-
- Scalability: Less tested on models with >100B parameters.