Speculative decoding là gì? Tại sao giảm latency inference 2-3x?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Bottleneck của LLM decode là memory bandwidth — mỗi token cần đọc toàn bộ weights từ VRAM. GPU compute hầu như idle trong decode. Ý tưởng speculative decoding (Leviathan 2022, Chen 2023): dùng một model nhỏ (draft model) sinh nhanh vài token, rồi target model lớn verify song song. Nếu draft đúng → chấp nhận; nếu sai → rollback và dùng output target. Cơ chế: 1. Draft model (small, fast) sinh K token dự đoán: t1, t2, ..., tK. 2. Target model (big, accurate) chạy 1 forward pass trên tất cả K+1 vị trí song song (batched) → thu được logprob cho mỗi position. 3. Verify tuần tự từ trái sang phải: với mỗi token draft tᵢ, check xác suất target có chấp nhận không (rejection sampling — nếu ptarget(tᵢ) ≥ pdraft(tᵢ) → accept; nếu không, accept với xác suất ptarget/pdraft). 4. Nếu tⱼ bị reject → dùng distribution điều chỉnh để sample token mới ở j, rollback từ đó. Quan trọng: output distribution giống y hệt target model chạy alone (mathematically equivalent). Không trade chất lượng. Tại sao nhanh hơn: - Target model chỉ chạy 1 forward pass cho K+1 token thay vì K+1 pass. Vì GPU compute dư trong decode (bandwidth-bound), batched forward gần như cùng thời gian với 1 token. - Draft model nhỏ hơn nhiều (VD 1B vs 70B) → sinh K token rất rẻ. - Nếu draft accurate rate cao → nhiều token chấp nhận mỗi "round" → speedup. Speedup thực tế: 2-3x cho LLaMA 70B với draft model 7B; có thể 4-5x khi prompt dễ predict (code, boilerplate). Các biến thể: 1. Vanilla speculative decoding — draft model train riêng hoặc dùng sẵn smaller model cùng family. 2. Medusa — thay vì draft model riêng, attach extra decoding heads vào target model; heads predict song song N token tiếp theo. Không cần extra model. 3. Lookahead decoding — dùng n-gram cache từ output trước để guess, verify. 4. Self-speculative — dùng layer skipping hoặc early exit của chính target model làm draft. 5. EAGLE / EAGLE-2 — train lightweight draft network dùng feature của target model; accuracy cao hơn Medusa. 6. Prompt Lookup Decoding — với task có input dài (code, document Q&A), guess tokens dựa trên match prompt → speedup cao trên code/summarization. Challenges: - Draft quality matters — draft kém → low acceptance rate → không speedup (có thể chậm hơn). Baseline: draft model cùng family, 1/10 size. - KV cache overhead — cần lưu KV cache cho cả draft và target. - Batching khó hơn — schedule multiple request với spec decode. - Not always worth it cho small target model (< 7B) — draft overhead không bù. Support: vLLM (speculativeconfig), TensorRT-LLM, SGLang, llama.cpp (--draft-model). Khi dùng: latency-sensitive (chat, coding assistant), target model lớn (≥ 13B). Không worth với low-throughput batch API vì throughput đã tối ưu.