Bottleneck của LLM decode là memory bandwidth — mỗi token cần đọc toàn bộ weights từ VRAM. GPU compute hầu như idle trong decode.
Ý tưởng speculative decoding (Leviathan 2022, Chen 2023): dùng một model nhỏ (draft model) sinh nhanh vài token, rồi target model lớn verify song song. Nếu draft đúng → chấp nhận; nếu sai → rollback và dùng output target.
Cơ chế:
1. Draft model (small, fast) sinh K token dự đoán: t1, t2, ..., tK.
2. Target model (big, accurate) chạy 1 forward pass trên tất cả K+1 vị trí song song (batched) → thu được logprob cho mỗi position.
3. Verify tuần tự từ trái sang phải: với mỗi token draft tᵢ, check xác suất target có chấp nhận không (rejection sampling — nếu p_target(tᵢ) ≥ p_draft(tᵢ) → accept; nếu không, accept với xác suất p_target/p_draft).
4. Nếu tⱼ bị reject → dùng distribution điều chỉnh để sample token mới ở j, rollback từ đó.
Quan trọng: output distribution giống y hệt target model chạy alone (mathematically equivalent). Không trade chất lượng.
Tại sao nhanh hơn:
- Target model chỉ chạy 1 forward pass cho K+1 token thay vì K+1 pass. Vì GPU compute dư trong decode (bandwidth-bound), batched forward gần như cùng thời gian với 1 token.
- Draft model nhỏ hơn nhiều (VD 1B vs 70B) → sinh K token rất rẻ.
- Nếu draft accurate rate cao → nhiều token chấp nhận mỗi "round" → speedup.
Speedup thực tế: 2-3x cho LLaMA 70B với draft model 7B; có thể 4-5x khi prompt dễ predict (code, boilerplate).
Các biến thể:
1. Vanilla speculative decoding — draft model train riêng hoặc dùng sẵn smaller model cùng family.
2. Medusa — thay vì draft model riêng, attach extra decoding heads vào target model; heads predict song song N token tiếp theo. Không cần extra model.
3. Lookahead decoding — dùng n-gram cache từ output trước để guess, verify.
4. Self-speculative — dùng layer skipping hoặc early exit của chính target model làm draft.
5. EAGLE / EAGLE-2 — train lightweight draft network dùng feature của target model; accuracy cao hơn Medusa.
6. Prompt Lookup Decoding — với task có input dài (code, document Q&A), guess tokens dựa trên match prompt → speedup cao trên code/summarization.
Challenges:
- Draft quality matters — draft kém → low acceptance rate → không speedup (có thể chậm hơn). Baseline: draft model cùng family, 1/10 size.
- KV cache overhead — cần lưu KV cache cho cả draft và target.
- Batching khó hơn — schedule multiple request với spec decode.
- Not always worth it cho small target model (< 7B) — draft overhead không bù.
Support: vLLM (speculative_config), TensorRT-LLM, SGLang, llama.cpp (--draft-model).
Khi dùng: latency-sensitive (chat, coding assistant), target model lớn (≥ 13B). Không worth với low-throughput batch API vì throughput đã tối ưu.
The LLM decode bottleneck is memory bandwidth — each token reads all weights from VRAM. GPU compute sits idle during decode.
Speculative decoding idea (Leviathan 2022, Chen 2023): use a small draft model to quickly generate a few tokens, then the big target model verifies them in parallel. If the draft is right → accept; if wrong → rollback and use target output.
Mechanism:
1. Draft model (small, fast) generates K candidate tokens: t1, t2, ..., tK.
2. Target model (big, accurate) runs 1 forward pass over all K+1 positions in parallel (batched) → logprobs per position.
3. Verify left-to-right: for each draft token tᵢ, check if the target accepts it (rejection sampling — accept if p_target(tᵢ) ≥ p_draft(tᵢ); otherwise accept with probability p_target/p_draft).
4. If tⱼ is rejected → use the adjusted distribution to sample a new token at j, rollback from there.
Crucial: the output distribution is identical to running the target alone (mathematically equivalent). No quality trade-off.
Why it's faster:
- The target runs 1 forward pass for K+1 tokens instead of K+1 passes. Because compute is spare during decode (bandwidth-bound), the batched forward is nearly the same time as a single token.
- The draft model is much smaller (e.g. 1B vs 70B) → generating K tokens is cheap.
- High draft-acceptance rate → many tokens accepted per round → speedup.
Real speedups: 2–3x for LLaMA-70B with a 7B draft; up to 4–5x when the prompt is predictable (code, boilerplate).
Variants:
1. Vanilla speculative decoding — separately trained draft, or a smaller model of the same family.
2. Medusa — instead of a separate draft, attach extra decoding heads to the target; heads predict the next N tokens in parallel. No extra model needed.
3. Lookahead decoding — uses an n-gram cache from prior output to guess, then verify.
4. Self-speculative — use layer skipping or early exit of the target model as the draft.
5. EAGLE / EAGLE-2 — trains a lightweight draft network on the target's features; higher accuracy than Medusa.
6. Prompt Lookup Decoding — for long-input tasks (code, doc Q&A), guess tokens by matching the prompt → big wins on code/summarization.
Challenges:
- Draft quality matters — bad draft → low acceptance rate → no speedup (can be slower). Baseline: draft model same family, 1/10 size.
- KV cache overhead — cache for both draft and target.
- Harder batching — scheduling multiple requests with spec decoding.
- Not always worth it for small target models (< 7B) — draft overhead exceeds gains.
Support: vLLM (speculative_config), TensorRT-LLM, SGLang, llama.cpp (--draft-model).
When to use: latency-sensitive workloads (chat, coding assistants), large targets (≥ 13B). Not worth it for low-throughput batch APIs — throughput already optimal.