AI gateway (hay LLM gateway / LLM proxy) là middleware giữa app và các LLM provider — tương tự API gateway truyền thống nhưng có domain knowledge về LLM.
Tại sao cần:
1. Multi-provider — một số endpoint dùng OpenAI, số khác Anthropic, self-host. Không muốn từng app integrate riêng.
2. Cost & usage tracking — attribute token/cost theo team/feature/user; budget alert.
3. Rate limit & quota — per team/user, không để team lẻ đốt hết quota.
4. Fallback & failover — provider down → switch sang backup.
5. Caching — semantic + prompt cache, giảm cost.
6. Guardrails — input/output filter, PII redaction trung tâm hoá.
7. Observability — trace, log, metrics thống nhất.
8. Security — API key quản lý ở gateway, app không có direct key.
9. Compliance — audit log, data residency enforcement.
10. A/B testing — route traffic giữa model/prompt version.
Kiến trúc:
┌─────────────┐ ┌──────────────────────────────┐ ┌─────────────┐
│ App A/B/C │────▶│ AI Gateway │────▶│ OpenAI │
└─────────────┘ │ │────▶│ Anthropic │
│ ┌──────────────────────┐ │────▶│ Google │
│ │ Auth & Tenant │ │────▶│ Self-host │
│ ├──────────────────────┤ │ │ (vLLM) │
│ │ Rate limit / quota │ │ └─────────────┘
│ ├──────────────────────┤ │
│ │ Input guardrail │ │
│ ├──────────────────────┤ │
│ │ Router / Fallback │ │
│ ├──────────────────────┤ │
│ │ Cache (semantic) │ │
│ ├──────────────────────┤ │
│ │ Retry / Circuit brk │ │
│ ├──────────────────────┤ │
│ │ Output guardrail │ │
│ ├──────────────────────┤ │
│ │ Observability │ │
│ └──────────────────────┘ │
└──────────────────────────────┘Components chi tiết:
1. OpenAI-compatible API — expose /v1/chat/completions, /v1/embeddings → app dùng OpenAI SDK không đổi code, chỉ đổi base_url.
2. Auth & tenant — API key per team/app; kiểm tra scope (team X chỉ được dùng model Y).
3. Rate limit & quota — Redis-based token bucket; daily/monthly cap per team; alert khi vượt X%.
4. Router:
- Static routing: request model=smart → map sang Claude 3.5 Sonnet.
- Cost routing: dùng model rẻ nhất đáp ứng quality.
- Cascade routing: try small model → fallback larger nếu low confidence.
- Semantic routing: classify query → pick specialized model.
- Capability routing: tool use → OpenAI/Anthropic (tốt hơn); pure text → cheap local.
5. Fallback chain — primary Anthropic → fallback OpenAI → fallback Gemini. Trigger: timeout, rate limit, 5xx.
6. Cache:
- Exact hash cache cho deterministic request (temperature=0).
- Semantic cache cho Q&A (embedding similarity).
- TTL theo use case.
- Cache trust level (không cache personalized, user-specific).
7. Guardrail (centralized):
- Input: prompt injection detect, PII redact, topic filter, toxicity.
- Output: PII leak scan, content moderation.
- Pluggable → dùng Llama Guard, Presidio, Azure Content Safety.
8. Observability:
- Trace ID xuyên suốt request.
- Prometheus metrics: RPS, latency, error rate, cost/minute per provider/team.
- Log to S3/BigQuery for analysis.
- Tích hợp Langfuse/LangSmith cho LLM-specific trace.
9. Policy engine — rule per team: "team HR không được gọi Anthropic với data có PII", "team Finance phải dùng Azure OpenAI (data residency EU)".
10. Retry, circuit breaker — per-provider health tracking; circuit trip sau N fail liên tục.
Options deployment:
Self-build:
- Node.js/Python service + Redis + Postgres (usage tracking).
- Deploy Kubernetes / Cloud Run / Lambda.
- Thời gian: vài sprint cho MVP.
Open-source:
- LiteLLM (Python) — gateway + SDK, supports 100+ provider, built-in fallback, caching, logging. De facto standard.
- OpenRouter — hosted gateway, unified API.
- Portkey (open-core) — more features, dashboard.
Commercial:
- Kong AI Gateway — enterprise, Kong platform.
- Cloudflare AI Gateway — edge-based, free tier rộng.
- Vercel AI Gateway — tích hợp sâu Vercel stack (Next.js, AI SDK).
- Portkey, Helicone, TrueFoundry, Langsmith Proxy — managed.
Best practices:
1. Versioned API ở gateway — breaking change upstream không ảnh hưởng app.
2. Multi-region — gateway gần app, reduce latency.
3. Graceful degradation — khi tất cả provider down, return cached/default response không lỗi.
4. Kill switch per model/feature — rollback fast.
5. Usage dashboard cho cost awareness trong org.
6. Don't be a bottleneck — gateway latency < 50ms; nếu nặng quá → scale out hoặc skip cho streaming path.
Anti-patterns:
- Gateway chặn streaming → phá UX chat.
- Tất cả app share 1 API key → không attribute được cost.
- Không cache → tốn tiền vô lý cho query lặp.
- Logging raw prompt/response có PII → compliance breach.
- Không có circuit breaker → 1 provider down kéo theo cả hệ thống.
An AI gateway (a.k.a. LLM gateway / LLM proxy) is middleware between apps and LLM providers — like a traditional API gateway but LLM-aware.
Why you need one:
1. Multi-provider — some endpoints use OpenAI, others Anthropic, others self-hosted. Don't make each app integrate separately.
2. Cost & usage tracking — attribute tokens/cost per team/feature/user; budget alerts.
3. Rate limits & quotas — per team/user; one rogue team doesn't burn the org's budget.
4. Fallback & failover — provider down → switch to backup.
5. Caching — semantic + prompt cache, reduce cost.
6. Guardrails — centralized input/output filtering, PII redaction.
7. Observability — unified trace, log, metrics.
8. Security — API keys live in the gateway, apps never hold them.
9. Compliance — audit logs, data residency enforcement.
10. A/B testing — traffic routing across model/prompt versions.
Architecture:
┌─────────────┐ ┌──────────────────────────────┐ ┌─────────────┐
│ App A/B/C │────▶│ AI Gateway │────▶│ OpenAI │
└─────────────┘ │ │────▶│ Anthropic │
│ ┌──────────────────────┐ │────▶│ Google │
│ │ Auth & Tenant │ │────▶│ Self-host │
│ ├──────────────────────┤ │ │ (vLLM) │
│ │ Rate limit / quota │ │ └─────────────┘
│ ├──────────────────────┤ │
│ │ Input guardrail │ │
│ ├──────────────────────┤ │
│ │ Router / Fallback │ │
│ ├──────────────────────┤ │
│ │ Cache (semantic) │ │
│ ├──────────────────────┤ │
│ │ Retry / Circuit brk │ │
│ ├──────────────────────┤ │
│ │ Output guardrail │ │
│ ├──────────────────────┤ │
│ │ Observability │ │
│ └──────────────────────┘ │
└──────────────────────────────┘Component details:
1. OpenAI-compatible API — expose /v1/chat/completions, /v1/embeddings → apps use the OpenAI SDK unchanged, just swap base_url.
2. Auth & tenant — API keys per team/app; check scopes (team X only allowed on model Y).
3. Rate limits & quotas — Redis-based token bucket; daily/monthly caps per team; alert at X% usage.
4. Router:
- Static routing: model=smart → Claude 3.5 Sonnet.
- Cost routing: cheapest model meeting quality target.
- Cascade routing: try small → fallback larger on low confidence.
- Semantic routing: classify query → specialized model.
- Capability routing: tool use → OpenAI/Anthropic; plain text → cheap local.
5. Fallback chain — primary Anthropic → fallback OpenAI → fallback Gemini. Triggers: timeout, rate limit, 5xx.
6. Cache:
- Exact-hash cache for deterministic requests (temperature=0).
- Semantic cache for Q&A (embedding similarity).
- TTL per use case.
- Cache trust level (don't cache personalized or user-specific).
7. Centralized guardrails:
- Input: prompt injection detection, PII redaction, topic filter, toxicity.
- Output: PII leak scan, content moderation.
- Pluggable → Llama Guard, Presidio, Azure Content Safety.
8. Observability:
- Trace ID end-to-end.
- Prometheus metrics: RPS, latency, error rate, cost/min per provider/team.
- Log to S3/BigQuery for analysis.
- Integrate Langfuse/LangSmith for LLM-specific tracing.
9. Policy engine — per-team rules: "team HR may not call Anthropic with PII data", "team Finance must use Azure OpenAI (EU data residency)".
10. Retries, circuit breaker — per-provider health tracking; circuit trips after N consecutive failures.
Deployment options:
Self-build:
- Node.js/Python service + Redis + Postgres (usage tracking).
- Deploy on Kubernetes / Cloud Run / Lambda.
- MVP timeline: a few sprints.
Open source:
- LiteLLM (Python) — gateway + SDK, supports 100+ providers, built-in fallback, caching, logging. De facto standard.
- OpenRouter — hosted gateway, unified API.
- Portkey (open core) — more features, dashboard.
Commercial:
- Kong AI Gateway — enterprise, on Kong.
- Cloudflare AI Gateway — edge-based, generous free tier.
- Vercel AI Gateway — deep integration with the Vercel stack (Next.js, AI SDK).
- Portkey, Helicone, TrueFoundry, LangSmith Proxy — managed.
Best practices:
1. Versioned API at the gateway — upstream breaking changes don't impact apps.
2. Multi-region — keep the gateway close to apps, reduce latency.
3. Graceful degradation — when every provider is down, return cached/default responses instead of failing.
4. Kill switches per model/feature — fast rollback.
5. Usage dashboard for cost awareness across the org.
6. Don't be a bottleneck — gateway latency < 50ms; scale out or skip for streaming path if heavy.
Anti-patterns:
- Gateway breaks streaming → kills chat UX.
- All apps share one API key → can't attribute cost.
- No caching → money burned on repeated queries.
- Logging raw prompts/responses with PII → compliance breach.
- No circuit breaker → one provider outage takes the whole system down.