LLM serving có metrics đặc thù khác web API truyền thống. Lỗi quan sát = không biết bottleneck ở đâu.
Latency metrics (cho UX streaming):
- TTFT (Time To First Token) — thời gian từ khi nhận request đến khi token đầu tiên về. Metric quan trọng nhất cho UX — user cảm thấy responsive khi TTFT < 1s. Ảnh hưởng bởi: prompt length (prefill time), queue depth, cold start.
- ITL / TPOT (Inter-Token Latency / Time Per Output Token) — thời gian giữa các token sau TTFT. Quan trọng cho perceived speed khi streaming. Ảnh hưởng bởi: memory bandwidth, batch contention.
- Total latency — end-to-end hoàn thành request.
TTFT + (output_tokens × ITL). - p50, p95, p99 — percentile, không chỉ mean. Tail latency quan trọng (1% user trải nghiệm xấu = bad review).
Throughput metrics:
- Requests/second (RPS) — số request hoàn thành mỗi giây.
- Output tokens/second — tổng token sinh ra hệ thống. Metric chính cho capacity planning.
- Input tokens/second — for prefill capacity.
- Concurrent requests — bao nhiêu request đang in-flight.
GPU metrics:
- GPU utilization % — nên > 80% khi có load. Thấp → dư capacity, hoặc bottleneck ngoài GPU (CPU, network).
- GPU memory utilization — nên 85-95% (vLLM
gpu_memory_utilization). Thấp → waste VRAM; cao quá → OOM rủi ro. - SM (Streaming Multiprocessor) utilization — chi tiết hơn GPU util. NVIDIA DCGM cung cấp.
- Memory bandwidth utilization — decode phase bottleneck. Ideal > 70%.
- Temperature, power draw — hardware health.
Batch metrics:
- Batch occupancy / size — trung bình request/batch; batch nhỏ → throughput kém.
- Prefill vs decode ratio — cân bằng scheduling.
- Preemption rate — nếu scheduler phải preempt → không đủ memory, cân scale.
Request-level metrics:
- Queue depth — request đợi xử lý. Cao → scale out hoặc giảm load.
- Queue wait time — đợi bao lâu trước khi enter batch.
- Prefill vs decode time breakdown.
- Retry / error rate — OOM, timeout, model output invalid.
Application metrics:
- Cost/request, cost/user, cost/feature — business level.
- Cache hit rate (prompt cache, semantic cache).
- Token usage per feature → optimize prompt nào đắt.
- Guardrail trigger rate.
- Quality metrics (nếu có eval continuous) — faithfulness, CSAT.
Tooling stack:
1. GPU/Infra layer:
- NVIDIA DCGM Exporter + Prometheus + Grafana — GPU metrics.
- nvidia-smi dmon — realtime CLI.
- Kubernetes HPA scale theo GPU util.
2. Serving layer (vLLM, TGI, Triton):
- Expose Prometheus metrics endpoint (vLLM /metrics).
- Metrics: vllm:num_requests_running, vllm:time_to_first_token_seconds, vllm:gpu_cache_usage_perc, vllm:request_success_total.
3. Application layer (LLM tracing):
- LangSmith, Langfuse, Arize Phoenix, Helicone — trace full request lifecycle, log prompt/response/token/cost.
- OpenLLMetry — OpenTelemetry extension cho LLM.
- Datadog LLM Observability, New Relic AI Monitoring.
4. Alerting:
- TTFT p95 > threshold.
- Queue depth > N.
- Error rate > X%.
- Cost/hour > budget.
- GPU OOM events.
Dashboard cần có:
- Overview: RPS, latency p50/p95/p99, error rate, cost/hour.
- Per-model/endpoint breakdown.
- GPU fleet status.
- Token usage trend.
- Top expensive queries (outlier detection).
Common pitfalls:
- Chỉ log total latency, không có TTFT → không biết prefill hay decode chậm.
- Monitor GPU util thôi, không có batch occupancy → không biết underutilized do batch nhỏ hay workload nhẹ.
- Không log token usage per request → không tính được cost accurately.
- Không có request ID correlation xuyên suốt app → LLM gateway → serving → khó debug.
LLM serving has specialized metrics beyond traditional web APIs. Missing observability = no idea where the bottleneck is.
Latency metrics (for streaming UX):
- TTFT (Time To First Token) — time from request receipt to first token out. Most important UX metric — users feel responsive when TTFT < 1s. Affected by: prompt length (prefill), queue depth, cold start.
- ITL / TPOT (Inter-Token Latency / Time Per Output Token) — time between tokens after TTFT. Drives perceived speed when streaming. Affected by: memory bandwidth, batch contention.
- Total latency — end-to-end per request.
TTFT + (output_tokens × ITL). - p50, p95, p99 — percentiles, not just mean. Tail latency matters (1% of users having a bad experience = bad reviews).
Throughput metrics:
- Requests/second (RPS) — completed requests per second.
- Output tokens/second — total system generation rate. Main metric for capacity planning.
- Input tokens/second — prefill capacity.
- Concurrent requests — in-flight count.
GPU metrics:
- GPU utilization % — should exceed 80% under load. Low → idle capacity, or non-GPU bottleneck (CPU, network).
- GPU memory utilization — target 85–95% (vLLM
gpu_memory_utilization). Low → wasted VRAM; too high → OOM risk. - SM (Streaming Multiprocessor) utilization — more granular than GPU util. Provided by NVIDIA DCGM.
- Memory bandwidth utilization — decode-phase bottleneck. Ideally > 70%.
- Temperature, power draw — hardware health.
Batch metrics:
- Batch occupancy / size — average requests per batch; small batches → poor throughput.
- Prefill vs decode ratio — scheduler balance.
- Preemption rate — scheduler preempting → memory-starved, scale.
Request-level metrics:
- Queue depth — pending requests. High → scale out or shed load.
- Queue wait time — how long before entering a batch.
- Prefill vs decode time breakdown.
- Retry / error rate — OOM, timeouts, invalid model output.
Application metrics:
- Cost/request, cost/user, cost/feature — business level.
- Cache hit rate (prompt cache, semantic cache).
- Token usage per feature → optimize expensive prompts.
- Guardrail trigger rate.
- Quality metrics (if continuous eval) — faithfulness, CSAT.
Tooling stack:
1. GPU/infra layer:
- NVIDIA DCGM Exporter + Prometheus + Grafana — GPU metrics.
- nvidia-smi dmon — realtime CLI.
- Kubernetes HPA scaling on GPU util.
2. Serving layer (vLLM, TGI, Triton):
- Expose Prometheus metrics endpoints (vLLM /metrics).
- Metrics: vllm:num_requests_running, vllm:time_to_first_token_seconds, vllm:gpu_cache_usage_perc, vllm:request_success_total.
3. Application layer (LLM tracing):
- LangSmith, Langfuse, Arize Phoenix, Helicone — trace the full request lifecycle, log prompt/response/tokens/cost.
- OpenLLMetry — OpenTelemetry extension for LLMs.
- Datadog LLM Observability, New Relic AI Monitoring.
4. Alerting:
- p95 TTFT > threshold.
- Queue depth > N.
- Error rate > X%.
- Hourly cost > budget.
- GPU OOM events.
Dashboards to build:
- Overview: RPS, latency p50/p95/p99, error rate, hourly cost.
- Per-model/endpoint breakdown.
- GPU fleet status.
- Token usage trends.
- Top-expensive queries (outlier detection).
Common pitfalls:
- Logging only total latency → can't tell if prefill or decode is slow.
- Only GPU util without batch occupancy → can't tell if underutilized due to small batches or light workload.
- Not logging per-request token usage → no accurate costing.
- No correlated request IDs across app → LLM gateway → serving → painful debugging.