Lesson 6.4.4 : DPO

Direct Preference Optimization (DPO) is a simpler, more efficient alternative to RLHF that eliminates the need for a reward model and reinforcement learning. Instead, it directly optimizes the LLM using preference data via a loss function derived from human rankings.The main idea is to make the reward model obsolete because the LLM could learn directly to increase the probability of completions which humans preferred and decrease the probabilty of less preferred completions.

How DPO Works

Pretraining a Base Model:
- Similar to RLHF, a base LLM is pretrained on a large text dataset.
Collecting Human Preferences:
- We let the LLM produce pairs of examples and humans rate which example are better than another. Same as with RHLF the better example gets labelled positive while the worse one gets negative.
- Humans rank pairs of model outputs (e.g., "Response A is preferred over Response B").
Direct Optimization:
- Instead of training a reward model, DPO uses a cross-entropy-based loss function to:
  - Increase the probability of preferred responses.
  - Decrease the probability of rejected responses.
- A reference model (usually the initial SFT model) ensures the LLM doesn’t drift too far from its original behavior.

Advantages Over RLHF

Simpler: No reward model or RL needed.
More Stable: Uses standard gradient descent instead of RL.
Faster Training: Fewer computational steps.
Better Performance: Often outperforms RLHF in practice.

When to Use DPO?

When you want efficient preference-based fine-tuning.
When RLHF is too complex or computationally expensive.

Comparison Summary

Aspect	RLHF	DPO
Mechanism	Uses reinforcement learning (PPO) + reward model	Direct optimization using preference data
Training Steps	Complex (reward model + RL fine-tuning)	Simple (single loss function)
Stability	Less stable (RL tuning challenges)	More stable (gradient-based optimization)
Compute Cost	High (requires reward model + RL)	Low (no RL or reward model)
Performance	Good, but sensitive to reward model quality	Often better and more consistent

Limitations

1. Data Quality Dependency: Requires clean, diverse preference pairs. Noisy data degrades results.
1. Reference Model Bias: Performance hinges on the initial reference model’s quality.
1. Scalability: Less tested on models with >100B parameters.