RLHF pipeline cổ điển phức tạp: SFT → train reward model → PPO fine-tune policy dùng reward model + KL penalty. Nhiều moving parts, khó train stable, cần 4 model cùng lúc (policy + reference + reward + critic).
DPO (Rafailov 2023) — "RLHF without RL". Insight: có thể derive closed-form optimal policy từ preference data mà không cần reward model và PPO.
Math intuition:
- Bắt đầu với Bradley-Terry preference model: P(y_w > y_l | x) = σ(r(x,y_w) - r(x,y_l)).
- Trong RLHF với KL constraint, optimal policy là: π(y|x) ∝ π_ref(y|x) · exp(r(x,y)/β).
- Rearrange: r(x,y) = β · log(π(y|x) / π_ref(y|x)) (up to constant).
- Substitute vào Bradley-Terry → loss function chỉ có policy và reference, không cần train reward model riêng:
L_DPO = -E[log σ(β · log(π_θ(y_w|x)/π_ref(y_w|x))
- β · log(π_θ(y_l|x)/π_ref(y_l|x)))]y_w: chosen response,y_l: rejected response,π_ref: SFT model (frozen),β: temperature (0.1-0.5).
Workflow:
1. Bắt đầu với SFT model (chính là π_ref).
2. Thu thập preference data: cho mỗi prompt có (chosen, rejected) pair — human hoặc AI judge label.
3. Train DPO loss trên pair data → policy mới tăng likelihood của chosen, giảm rejected, với regularization qua KL.
4. No reward model, no PPO, no rollout.
Ưu điểm vs RLHF/PPO:
- Đơn giản hơn nhiều — chỉ cần supervised loss, như SFT thêm.
- Stable — không có RL dynamics (policy drift, reward hacking).
- Ít memory — 2 model (policy + ref) thay vì 4.
- Code ngắn — HF TRL DPOTrainer ~5 dòng setup.
- Performance comparable hoặc tốt hơn PPO trên Anthropic HH, UltraFeedback benchmarks.
Hạn chế:
- Không online — train offline trên fixed preference data. RLHF-PPO có thể generate và label runtime.
- Sensitive đến β — tune hyperparameter quan trọng.
- Overfitting preference data — nếu dataset nhỏ/bias, policy bị lệch.
- Không học concept từ reward — chỉ học pattern pairwise.
Biến thể:
- IPO (Azar 2023) — fix DPO overfitting với identity preference mapping.
- KTO (Ethayarajh 2024) — chỉ cần single label (good/bad) thay vì pair, dễ thu data hơn.
- ORPO (Hong 2024) — kết hợp SFT + preference vào 1 loss, bỏ SFT step riêng.
- SimPO (Meng 2024) — bỏ reference model, simplify thêm.
- GRPO (DeepSeek) — RL-based, dùng trong R1.
Khi nào dùng:
- DPO: default cho open-source fine-tune alignment (Zephyr, Tulu, Llama 3 Instruct một phần dùng DPO).
- PPO: vẫn dùng cho online RLHF (OpenAI, Anthropic production pipeline); robust hơn với preference data noisy.
- RLAIF / Constitutional AI với DPO — thay human preference bằng AI preference (Claude approach).
Data format (HF TRL):
{
"prompt": "Explain quantum entanglement",
"chosen": "Quantum entanglement is...", # better response
"rejected": "Quantum physics stuff..." # worse response
}Code:
from trl import DPOTrainer, DPOConfig
trainer = DPOTrainer(
model=sft_model,
ref_model=sft_model_frozen, # or None (auto-copy)
args=DPOConfig(beta=0.1, learning_rate=5e-7, num_train_epochs=1),
train_dataset=pref_dataset,
tokenizer=tokenizer,
)
trainer.train()Classic RLHF pipeline is complex: SFT → train reward model → PPO fine-tunes policy with reward model + KL penalty. Many moving parts, hard to train stably, needs 4 models simultaneously (policy + reference + reward + critic).
DPO (Rafailov 2023) — "RLHF without RL". Insight: you can derive a closed-form optimal policy from preference data without a reward model or PPO.
Math intuition:
- Start with the Bradley-Terry preference model: P(y_w > y_l | x) = σ(r(x,y_w) - r(x,y_l)).
- In RLHF under a KL constraint, the optimal policy is: π(y|x) ∝ π_ref(y|x) · exp(r(x,y)/β).
- Rearrange: r(x,y) = β · log(π(y|x) / π_ref(y|x)) (up to a constant).
- Substitute into Bradley-Terry → loss function uses only policy and reference, no separate reward model needed:
L_DPO = -E[log σ(β · log(π_θ(y_w|x)/π_ref(y_w|x))
- β · log(π_θ(y_l|x)/π_ref(y_l|x)))]y_w: chosen response,y_l: rejected response,π_ref: SFT model (frozen),β: temperature (0.1–0.5).
Workflow:
1. Start from an SFT model (it becomes π_ref).
2. Collect preference data: per prompt, a (chosen, rejected) pair — human- or AI-labeled.
3. Train with DPO loss on pair data → new policy increases likelihood of chosen, decreases rejected, regularized by KL.
4. No reward model, no PPO, no rollout.
Advantages over RLHF/PPO:
- Much simpler — only supervised loss, basically enhanced SFT.
- Stable — no RL dynamics (policy drift, reward hacking).
- Less memory — 2 models (policy + ref) instead of 4.
- Short code — HF TRL DPOTrainer is ~5 lines of setup.
- Performance comparable or better than PPO on Anthropic HH and UltraFeedback benchmarks.
Limitations:
- Not online — trains offline on fixed preference data. RLHF-PPO can generate and label at runtime.
- Sensitive to β — hyperparameter tuning matters.
- Preference-data overfitting — small/biased datasets skew the policy.
- No reward concept learned — only learns pairwise patterns.
Variants:
- IPO (Azar 2023) — fixes DPO overfitting with identity preference mapping.
- KTO (Ethayarajh 2024) — needs only single labels (good/bad), easier to collect.
- ORPO (Hong 2024) — merges SFT + preference into one loss, skipping the SFT step.
- SimPO (Meng 2024) — drops the reference model, simpler still.
- GRPO (DeepSeek) — RL-based, used in R1.
When to use:
- DPO: default for open-source alignment (Zephyr, Tulu, part of Llama 3 Instruct).
- PPO: still used for online RLHF (OpenAI, Anthropic production pipelines); more robust to noisy preference data.
- RLAIF / Constitutional AI with DPO — replace human preferences with AI preferences (Claude approach).
Data format (HF TRL):
{
"prompt": "Explain quantum entanglement",
"chosen": "Quantum entanglement is...", # better response
"rejected": "Quantum physics stuff..." # worse response
}Code:
from trl import DPOTrainer, DPOConfig
trainer = DPOTrainer(
model=sft_model,
ref_model=sft_model_frozen, # or None (auto-copy)
args=DPOConfig(beta=0.1, learning_rate=5e-7, num_train_epochs=1),
train_dataset=pref_dataset,
tokenizer=tokenizer,
)
trainer.train()