Chọn GPU cho LLM inference như thế nào? Memory, bandwidth, compute.

Question

Luyện Phỏng Vấn IT · Accepted Answer

Chọn GPU sai → serving cost gấp 2-5x. 3 yếu tố cần cân:

1. VRAM (GPU memory) — yếu tố giới hạn trước tiên. Phải đủ chứa:
- Model weights: params × bytes_per_param. LLaMA-70B FP16 = 140GB; INT4 = 35GB.
- KV cache: 2 × n_layers × n_kv_heads × head_dim × seq_len × batch × bytes. Với LLaMA-70B context 8K batch 1 ~1.3GB; batch 32 ~40GB.
- Activation buffers, CUDA graphs, framework overhead (~10-20%).

2. Memory bandwidth — quyết định tốc độ inference (khi decode token-by-token, mỗi token cần đọc toàn bộ weights từ VRAM). Không phải FLOPs.
- Công thức xấp xỉ: tokens/s ≤ bandwidth / model_size.
- LLaMA-70B FP16 (140GB) trên H100 (3.35 TB/s) → ~24 tok/s theoretical max per stream; thực tế ~15-20 tok/s.
- Trên RTX 4090 (1 TB/s) → ~7 tok/s. Bandwidth gap lớn hơn compute gap.

3. Compute (FLOPs + Tensor Cores) — quan trọng cho prefill (input context dài) và training. Decode token-by-token bandwidth-bound, không FLOPs-bound.
- H100 cung cấp FP8, FP4 Tensor Core → prefill nhanh hơn 2-4x A100.

GPU options 2025:

GPU	VRAM	Bandwidth	FP16 TFLOPS	Price/hr
H100 SXM	80GB	3.35 TB/s	989	$2-5
H100 PCIe	80GB	2 TB/s	756	$1.5-3
H200	141GB	4.8 TB/s	989	$3-6
B200 (Blackwell)	192GB	8 TB/s	2250 (FP8)	$5-10
A100 80GB	80GB	2 TB/s	312	$1-2.5
L40S	48GB	864 GB/s	362	$0.8-1.5
RTX 4090	24GB	1 TB/s	165	$0.3-0.6 (consumer cloud)
RTX 3090	24GB	936 GB/s	142	$0.2-0.5
Apple M4 Max/Ultra	unified	400-800 GB/s	-	-

Khung quyết định:

Model < 7B, budget constrained: RTX 4090/3090, L40S. Serve 1-4 concurrent user.
7B-13B production: A100 40/80GB hoặc L40S 48GB. Enough for serving 10-50 user.
34B-70B production: 1-2x H100/A100 80GB (cần model parallel nếu FP16).
70B với quantize INT4: 1x A100 80GB hoặc 2x RTX 4090 48GB total.
100B+ (Llama 3.1 405B): 8x H100 node (tensor parallel).
Training fine-tune: H100/H200 SXM với NVLink (bandwidth cross-GPU).
Edge / on-device: Apple Silicon (M4 Ultra), Qualcomm NPU, Jetson.

Tối ưu dùng GPU:
- Continuous batching (vLLM, TGI, SGLang) — dynamic batch các request → throughput 10-20x so với static batch.
- Quantize (AWQ INT4, GPTQ) → chứa model to hơn trong VRAM, giảm bandwidth pressure.
- Speculative decoding → giảm latency sinh token.
- Tensor / Pipeline parallel khi model > single GPU.
- Prefix caching — share KV cache giữa các request cùng prefix.

Serving framework: vLLM (default cho SOTA throughput), SGLang (structured output fast), TensorRT-LLM (NVIDIA tối ưu nhất), TGI (HuggingFace), llama.cpp (CPU/Mac/edge), Ollama (dev local).