LLM inference có đặc thù: mỗi request có số token output khác nhau → time hoàn thành khác nhau. Batching strategy quyết định throughput.
Static batching (naïve):
- Gom N request, chạy song song, đợi TẤT CẢ xong → return.
- Vấn đề: request xong trước phải đợi request dài nhất → GPU idle.
- Ví dụ: batch 4 request, output length [50, 100, 200, 500] token. Request 1 xong sau 50 step nhưng phải chờ 450 step nữa → 90% thời gian GPU idle cho request đó.
Continuous batching (iteration-level scheduling) — key insight: schedule ở cấp độ iteration/token, không phải request.
- Mỗi step, scheduler check: request nào xong → evict khỏi batch, free VRAM; request mới trong queue → admit vào batch slot trống.
- GPU luôn chạy với batch đầy, không có idle slot.
- Triển khai lần đầu trong Orca (OSDI 2022), giờ là default trong vLLM, TGI, TensorRT-LLM, SGLang, DeepSpeed-FastGen.
Impact:
- Throughput tăng 10-20x (paper vLLM báo cáo, thực tế 5-15x tùy workload).
- Latency p50 tương đương static, p99 tốt hơn nhiều.
- GPU utilization 80-95% thay vì 20-40%.
Challenges continuous batching:
1. Memory management — KV cache của các request có size khác nhau và thay đổi theo time (mỗi token sinh ra thêm 1 KV entry). Static allocate đủ max length tốn rất nhiều VRAM.
Giải pháp: PagedAttention (vLLM) — inspired by OS virtual memory:
- Chia KV cache thành block cố định (VD 16 token/block).
- Mỗi request có block table ánh xạ logical → physical block.
- Allocate on-demand khi request dài ra → không lãng phí.
- Dễ share block (prefix caching): 2 request cùng prefix → share cùng physical block.
Giải pháp khác: RadixAttention (SGLang) — lưu KV cache dưới dạng radix tree để prefix sharing efficient hơn.
2. Prefill vs decode contention — prefill (xử lý input context) là compute-bound, decode (sinh từng token) là bandwidth-bound. Mix trong cùng batch có thể gây jitter.
Giải pháp: Chunked prefill (vLLM) — chia prefill thành chunk nhỏ, xen kẽ với decode steps → latency đều.
3. Scheduling policy — khi queue có nhiều request, chọn admit cái nào vào batch? FIFO? Priority? Shortest-job-first?
4. Out of memory — batch quá lớn → OOM. Scheduler cần predict VRAM usage, preempt request nếu cần (swap KV cache ra CPU).
Cấu hình vLLM thực tế (tham khảo):
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
gpu_memory_utilization=0.9,
max_num_batched_tokens=8192,
max_num_seqs=256, # max concurrent requests
enable_prefix_caching=True,
enable_chunked_prefill=True,
)Metrics cần monitor:
- Throughput (tokens/s output toàn hệ thống).
- TTFT (time to first token, quan trọng cho UX).
- ITL (inter-token latency, quan trọng cho streaming UX).
- GPU utilization % (nên > 80% khi có load).
- Batch occupancy — trung bình bao nhiêu request trong batch.
- Queue depth — nếu pending cao → scale out.
Khi nào không dùng continuous batching: use case online strict latency (trading, realtime voice) mà jitter không chấp nhận được → prefer single request với tensor parallel thay vì batch. Hoặc load quá thấp (< 10 req/s) thì static cũng ổn.
LLM inference has a quirk: each request produces a variable number of output tokens → different completion times. Batching strategy determines throughput.
Static batching (naïve):
- Group N requests, run in parallel, wait for ALL to finish → return.
- Problem: requests that finish early wait for the longest one → GPU idles.
- Example: batch of 4 with output lengths [50, 100, 200, 500] tokens. Request 1 finishes at step 50 but must wait 450 more → 90% idle time for it.
Continuous batching (iteration-level scheduling) — key insight: schedule at the iteration/token level, not per request.
- Each step, the scheduler checks: any request done → evict from batch, free VRAM; any queued request → admit into the slot.
- GPU runs with a full batch, no idle slots.
- First introduced in Orca (OSDI 2022), now default in vLLM, TGI, TensorRT-LLM, SGLang, DeepSpeed-FastGen.
Impact:
- Throughput up 10–20x (vLLM paper claim; 5–15x in practice depending on workload).
- p50 latency similar to static, p99 much better.
- GPU utilization 80–95% instead of 20–40%.
Continuous batching challenges:
1. Memory management — KV caches vary in size per request and grow over time (every generated token adds one KV entry). Statically allocating max-length wastes lots of VRAM.
Solution: PagedAttention (vLLM) — inspired by OS virtual memory:
- Split KV cache into fixed blocks (e.g. 16 tokens/block).
- Each request has a block table mapping logical → physical blocks.
- Allocate on demand as requests grow → no waste.
- Easy block sharing (prefix caching): two requests sharing a prefix → share physical blocks.
Alternative: RadixAttention (SGLang) — stores KV cache as a radix tree for efficient prefix sharing.
2. Prefill vs decode contention — prefill (processing input context) is compute-bound, decode (token-by-token) is bandwidth-bound. Mixing in the same batch causes jitter.
Solution: Chunked prefill (vLLM) — split prefill into small chunks interleaved with decode steps → smoother latency.
3. Scheduling policy — with a busy queue, which request do you admit into the batch? FIFO? Priority? Shortest-job-first?
4. Out of memory — batches too large → OOM. Scheduler predicts VRAM usage, preempts (swaps KV cache to CPU) when needed.
Practical vLLM config:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
gpu_memory_utilization=0.9,
max_num_batched_tokens=8192,
max_num_seqs=256, # max concurrent requests
enable_prefix_caching=True,
enable_chunked_prefill=True,
)Metrics to monitor:
- Throughput (system output tokens/s).
- TTFT (time to first token, critical for UX).
- ITL (inter-token latency, critical for streaming UX).
- GPU utilization % (should exceed 80% under load).
- Batch occupancy — average requests in a batch.
- Queue depth — high pending → scale out.
When NOT to use continuous batching: strict-latency online use cases (trading, realtime voice) that can't tolerate jitter → prefer single request with tensor parallel over batching. Or very low load (< 10 req/s) where static suffices.