Quyết định chiến lược có tác động lớn đến roadmap, team, và unit economics.
Khi nào NÊN dùng API (OpenAI, Anthropic, Google, Bedrock, Vertex):
1. Volume thấp / không đều — < 10M token/day. API rẻ hơn và không cần GPU capacity planning.
2. Need frontier model — GPT-4, Claude 3.5 Opus, o1, Gemini Ultra — không có open weights equivalent.
3. Nhanh go-to-market — không có team ML ops.
4. Không có data sensitivity critical — có ZDR agreement thỏa mãn compliance.
5. Multi-modal đặc biệt — Sora, Veo, Gemini vision cutting-edge chỉ có API.
Khi nào NÊN self-host:
1. Volume rất cao — > 100M-1B token/day. Break-even vs API tùy model; thường 100M+ bắt đầu tiết kiệm.
2. Data sovereignty — regulated industry (healthcare HIPAA, finance, government) yêu cầu data không ra ngoài.
3. Ultra-low latency — real-time use case cần co-locate với app (trading, robotics).
4. Custom model — fine-tune nặng, weights không share được.
5. Predictable cost — biết trước spend thay vì per-token.
6. Air-gapped — on-prem, không internet.
7. Provider independence — không muốn lock-in.
Break-even analysis (số thô):
- API: GPT-4o $2.5/$10 per 1M token input/output (as of 2024, giá thay đổi với GPT-5). Ở 100M token/day mix 50/50 → ~$500k/tháng.
- Self-host Llama 3.3 70B (hoặc Llama 4 — upgrade path) trên 2x H100 80GB: GPU rental ~$4/hr × 2 × 720h = $5,760/tháng per instance. Serve ~50 RPS. Cần 10-20 instance → ~$60-120k/tháng + team + overhead.
- Đại khái self-host break-even quanh 50M-200M token/day cho model 70B. Dưới mức đó API rẻ hơn; trên mức đó self-host thắng nếu vận hành tốt.
Stack self-host typical (2025):
Model: Llama 3.3 70B, Qwen 2.5 72B, DeepSeek-V3, Mixtral 8x22B — open weights chất lượng cao.
Serving engine:
- vLLM — default, balance tốt.
- TensorRT-LLM — fastest on NVIDIA nhưng build phức tạp.
- SGLang — structured output fast.
- Text Generation Inference (TGI) — HuggingFace.
Orchestration:
- Kubernetes với GPU node pool.
- KServe, Ray Serve, Seldon, BentoML — model serving framework.
- Autoscaling theo queue depth / GPU util.
Gateway:
- LiteLLM — unified API compatible với OpenAI format; route giữa providers và self-host.
- Portkey, Kong AI Gateway — commercial AI gateway.
- Handle: rate limit, retry, fallback, cost tracking.
Compute sourcing:
- Cloud managed: AWS Bedrock (host open model), GCP Vertex, Azure ML — tiện nhưng đắt hơn raw GPU.
- Raw GPU cloud: CoreWeave, Lambda Labs, RunPod, Together AI, Paperspace — cheap, flexible.
- Hyperscaler raw GPU: AWS (P5, P4), GCP (A3, G2), Azure — enterprise agreement.
- On-premise — capex lớn, chỉ rational ở quy mô lớn.
Thách thức vận hành self-host:
1. Team — cần ML ops engineer, cost 1-3 FTE.
2. Capacity planning — khó vì workload không đều; over-provision lãng phí, under → latency/error.
3. Hardware reliability — GPU fail, driver issues, CUDA version conflict.
4. Model updates — self-host nghĩa là tự test/deploy model mới (Llama 3.3 → Llama 4).
5. Observability — phải tự build (API providers có sẵn).
6. Multi-region, DR — replication, failover.
7. Security — GPU driver CVE, weight file integrity.
Hybrid strategy (thực tế phổ biến):
- Router chọn self-host (cho 80% query đơn giản) hoặc API (cho 20% query khó cần frontier model).
- API làm primary + self-host làm fallback cho outage.
- Fine-tuned self-host cho core flow + API cho edge case.
Công cụ đánh giá: tính Total Cost of Ownership (TCO) 2-3 năm, không chỉ variable cost. Include: team, infra, observability, incident, model upgrade cycle.
A strategic decision with big impact on roadmap, team, and unit economics.
Use API (OpenAI, Anthropic, Google, Bedrock, Vertex) when:
1. Low / uneven volume — < 10M tokens/day. APIs are cheaper and no GPU capacity planning.
2. Need frontier models — GPT-4, Claude 3.5 Opus, o1, Gemini Ultra — no open-weights equivalent.
3. Fast time to market — no ML ops team.
4. Data sensitivity isn't critical — ZDR agreement is compliance-sufficient.
5. Special multi-modal — Sora, Veo, cutting-edge Gemini vision only on API.
Self-host when:
1. Very high volume — > 100M–1B tokens/day. Break-even vs API depends on model; typically 100M+ starts saving.
2. Data sovereignty — regulated industries (healthcare HIPAA, finance, government) require data to stay put.
3. Ultra-low latency — real-time use cases co-located with the app (trading, robotics).
4. Custom models — heavy fine-tuning, non-shareable weights.
5. Predictable cost — fixed spend vs per-token.
6. Air-gapped — on-prem, no internet.
7. Provider independence — avoid lock-in.
Break-even analysis (rough):
- API: GPT-4o $2.5/$10 per 1M input/output tokens (as of 2024; pricing will change with GPT-5). At 100M tokens/day with a 50/50 mix → ~$500k/month.
- Self-host Llama 3.3 70B (or Llama 4 as upgrade path) on 2x H100 80GB: GPU rental ~$4/hr × 2 × 720h = $5,760/month per instance. Serves ~50 RPS. Need 10–20 instances → ~$60–120k/month + team + overhead.
- Rough self-host break-even is around 50M–200M tokens/day for 70B models. Below that APIs win; above that self-host wins with competent ops.
Typical self-host stack (2025):
Models: Llama 3.3 70B, Qwen 2.5 72B, DeepSeek-V3, Mixtral 8x22B — high-quality open weights.
Serving engine:
- vLLM — default, best balance.
- TensorRT-LLM — fastest on NVIDIA but complex to build.
- SGLang — fast structured output.
- Text Generation Inference (TGI) — HuggingFace.
Orchestration:
- Kubernetes with GPU node pools.
- KServe, Ray Serve, Seldon, BentoML — model serving frameworks.
- Autoscale on queue depth / GPU utilization.
Gateway:
- LiteLLM — unified OpenAI-compatible API; routes between providers and self-host.
- Portkey, Kong AI Gateway — commercial AI gateways.
- Handles: rate limits, retries, fallbacks, cost tracking.
Compute sourcing:
- Managed cloud: AWS Bedrock (hosts open models), GCP Vertex, Azure ML — convenient but pricier than raw GPUs.
- Raw GPU cloud: CoreWeave, Lambda Labs, RunPod, Together AI, Paperspace — cheap and flexible.
- Hyperscaler raw GPU: AWS (P5, P4), GCP (A3, G2), Azure — enterprise agreements.
- On-prem — heavy capex, only rational at large scale.
Self-host operational challenges:
1. Team — needs ML ops engineering, 1–3 FTE.
2. Capacity planning — hard due to uneven workloads; over-provision wastes, under → latency/errors.
3. Hardware reliability — GPU failures, driver issues, CUDA version conflicts.
4. Model updates — self-hosting means testing/deploying new models yourself (Llama 3.3 → Llama 4).
5. Observability — you build it (providers have it ready).
6. Multi-region, DR — replication, failover.
7. Security — GPU driver CVEs, weight file integrity.
Hybrid strategy (common in practice):
- Router picks self-host (for 80% easy queries) or API (for 20% hard queries needing frontier models).
- API as primary + self-host as fallback during outages.
- Fine-tuned self-host for core flows + API for edge cases.
Evaluation tool: compute 2–3-year Total Cost of Ownership (TCO), not just variable cost. Include: team, infra, observability, incidents, model upgrade cycles.