DPO (Direct Preference Optimization) hoạt động thế nào? Tại sao thay thế RLHF?

Question

Luyện Phỏng Vấn IT · Accepted Answer

RLHF pipeline cổ điển phức tạp: SFT → train reward model → PPO fine-tune policy dùng reward model + KL penalty. Nhiều moving parts, khó train stable, cần 4 model cùng lúc (policy + reference + reward + critic). DPO (Rafailov 2023) — "RLHF without RL". Insight: có thể derive closed-form optimal policy từ preference data mà không cần reward model và PPO. Math intuition: - Bắt đầu với Bradley-Terry preference model: P(yw > yl | x) = σ(r(x,yw) - r(x,yl)). - Trong RLHF với KL constraint, optimal policy là: π(y|x) ∝ πref(y|x) · exp(r(x,y)/β). - Rearrange: r(x,y) = β · log(π(y|x) / πref(y|x)) (up to constant). - Substitute vào Bradley-Terry → loss function chỉ có policy và reference, không cần train reward model riêng: - yw: chosen response, yl: rejected response, πref: SFT model (frozen), β: temperature (0.1-0.5). Workflow: 1. Bắt đầu với SFT model (chính là πref). 2. Thu thập preference data: cho mỗi prompt có (chosen, rejected) pair — human hoặc AI judge label. 3. Train DPO loss trên pair data → policy mới tăng likelihood của chosen, giảm rejected, với regularization qua KL. 4. No reward model, no PPO, no rollout. Ưu điểm vs RLHF/PPO: - Đơn giản hơn nhiều — chỉ cần supervised loss, như SFT thêm. - Stable — không có RL dynamics (policy drift, reward hacking). - Ít memory — 2 model (policy + ref) thay vì 4. - Code ngắn — HF TRL DPOTrainer ~5 dòng setup. - Performance comparable hoặc tốt hơn PPO trên Anthropic HH, UltraFeedback benchmarks. Hạn chế: - Không online — train offline trên fixed preference data. RLHF-PPO có thể generate và label runtime. - Sensitive đến β — tune hyperparameter quan trọng. - Overfitting preference data — nếu dataset nhỏ/bias, policy bị lệch. - Không học concept từ reward — chỉ học pattern pairwise. Biến thể: - IPO (Azar 2023) — fix DPO overfitting với identity preference mapping. - KTO (Ethayarajh 2024) — chỉ cần single label (good/bad) thay vì pair, dễ thu data hơn. - ORPO (Hong 2024) — kết hợp SFT + preference vào 1 loss, bỏ SFT step riêng. - SimPO (Meng 2024) — bỏ reference model, simplify thêm. - GRPO (DeepSeek) — RL-based, dùng trong R1. Khi nào dùng: - DPO: default cho open-source fine-tune alignment (Zephyr, Tulu, Llama 3 Instruct một phần dùng DPO). - PPO: vẫn dùng cho online RLHF (OpenAI, Anthropic production pipeline); robust hơn với preference data noisy. - RLAIF / Constitutional AI với DPO — thay human preference bằng AI preference (Claude approach). Data format (HF TRL): Code: