Self-hosted LLM vs API-based inference: khi nào, chi phí và thách thức?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Quyết định chiến lược có tác động lớn đến roadmap, team, và unit economics.

Khi nào NÊN dùng API (OpenAI, Anthropic, Google, Bedrock, Vertex):

1. Volume thấp / không đều — < 10M token/day. API rẻ hơn và không cần GPU capacity planning.
2. Need frontier model — GPT-4, Claude 3.5 Opus, o1, Gemini Ultra — không có open weights equivalent.
3. Nhanh go-to-market — không có team ML ops.
4. Không có data sensitivity critical — có ZDR agreement thỏa mãn compliance.
5. Multi-modal đặc biệt — Sora, Veo, Gemini vision cutting-edge chỉ có API.

Khi nào NÊN self-host:

1. Volume rất cao — > 100M-1B token/day. Break-even vs API tùy model; thường 100M+ bắt đầu tiết kiệm.
2. Data sovereignty — regulated industry (healthcare HIPAA, finance, government) yêu cầu data không ra ngoài.
3. Ultra-low latency — real-time use case cần co-locate với app (trading, robotics).
4. Custom model — fine-tune nặng, weights không share được.
5. Predictable cost — biết trước spend thay vì per-token.
6. Air-gapped — on-prem, không internet.
7. Provider independence — không muốn lock-in.

Break-even analysis (số thô):
- API: GPT-4o $2.5/$10 per 1M token input/output (as of 2024, giá thay đổi với GPT-5). Ở 100M token/day mix 50/50 → ~$500k/tháng.
- Self-host Llama 3.3 70B (hoặc Llama 4 — upgrade path) trên 2x H100 80GB: GPU rental ~$4/hr × 2 × 720h = $5,760/tháng per instance. Serve ~50 RPS. Cần 10-20 instance → ~$60-120k/tháng + team + overhead.
- Đại khái self-host break-even quanh 50M-200M token/day cho model 70B. Dưới mức đó API rẻ hơn; trên mức đó self-host thắng nếu vận hành tốt.

Stack self-host typical (2025):

Model: Llama 3.3 70B, Qwen 2.5 72B, DeepSeek-V3, Mixtral 8x22B — open weights chất lượng cao.

Serving engine:
- vLLM — default, balance tốt.
- TensorRT-LLM — fastest on NVIDIA nhưng build phức tạp.
- SGLang — structured output fast.
- Text Generation Inference (TGI) — HuggingFace.

Orchestration:
- Kubernetes với GPU node pool.
- KServe, Ray Serve, Seldon, BentoML — model serving framework.
- Autoscaling theo queue depth / GPU util.

Gateway:
- LiteLLM — unified API compatible với OpenAI format; route giữa providers và self-host.
- Portkey, Kong AI Gateway — commercial AI gateway.
- Handle: rate limit, retry, fallback, cost tracking.

Compute sourcing:
- Cloud managed: AWS Bedrock (host open model), GCP Vertex, Azure ML — tiện nhưng đắt hơn raw GPU.
- Raw GPU cloud: CoreWeave, Lambda Labs, RunPod, Together AI, Paperspace — cheap, flexible.
- Hyperscaler raw GPU: AWS (P5, P4), GCP (A3, G2), Azure — enterprise agreement.
- On-premise — capex lớn, chỉ rational ở quy mô lớn.

Thách thức vận hành self-host:

1. Team — cần ML ops engineer, cost 1-3 FTE.
2. Capacity planning — khó vì workload không đều; over-provision lãng phí, under → latency/error.
3. Hardware reliability — GPU fail, driver issues, CUDA version conflict.
4. Model updates — self-host nghĩa là tự test/deploy model mới (Llama 3.3 → Llama 4).
5. Observability — phải tự build (API providers có sẵn).
6. Multi-region, DR — replication, failover.
7. Security — GPU driver CVE, weight file integrity.

Hybrid strategy (thực tế phổ biến):
- Router chọn self-host (cho 80% query đơn giản) hoặc API (cho 20% query khó cần frontier model).
- API làm primary + self-host làm fallback cho outage.
- Fine-tuned self-host cho core flow + API cho edge case.

Công cụ đánh giá: tính Total Cost of Ownership (TCO) 2-3 năm, không chỉ variable cost. Include: team, infra, observability, incident, model upgrade cycle.