Đây là system design classic cho AI engineering phỏng vấn. Các thành phần chính:
1. Frontend / Channel
- Web chat widget, mobile SDK, messaging (Zalo/Messenger/WhatsApp/Telegram), email, voice.
- Session management: session_id gắn với user_id, persist state qua Redis.
2. API Gateway / Orchestrator
- Auth (JWT, API key), rate limiting (per user), PII redaction (Presidio) trước khi forward.
- Request routing: phân loại intent (FAQ vs technical vs billing vs escalation) để chọn pipeline phù hợp.
3. Core LLM pipeline
- System prompt tải từ versioned store (PromptLayer) — chứa tone, scope, safety rules, company info.
- Conversation memory: Redis lưu N turn gần nhất; summarization khi quá dài (LLM nhỏ tóm tắt).
- User profile memory: Postgres/vector DB lưu preference, historical tickets, subscription plan.
4. RAG layer (knowledge)
- Knowledge base: help articles, product docs, past resolved tickets, policies.
- Pipeline: embed user query → hybrid search (Qdrant dense + BM25) → cross-encoder rerank → top-3-5 chunks.
- Metadata filter: theo product, language, user plan (chỉ show feature user có).
- Source citation bắt buộc trong response.
5. Tool / Agent layer (khi cần action)
- Tools: lookup_order, check_shipment, create_ticket, refund_request, escalate_to_human.
- Principle of least privilege: read tools không cần confirm; write tools (refund > $X, cancel subscription) yêu cầu confirmation step.
- Agent loop có max_iterations=10, timeout 30s.
6. Safety / Guardrails
- Input: PII redact, prompt injection detection, toxicity filter, scope filter.
- Output: PII leak scan, faithfulness check vs context, content moderation, competitor mention filter.
- Escalation trigger: khi confidence thấp, user yêu cầu human, vấn đề nhạy cảm (complaint, legal) → handoff agent human.
7. Model strategy
- Router: classifier nhỏ (Haiku/4o-mini) phân loại → route:
- FAQ đơn giản → Haiku/4o-mini + RAG (rẻ, nhanh).
- Multi-step task → Sonnet/4o + tool use.
- Edge case / escalated → Opus/o1.
- Prompt cache cho system prompt + tool schema (giảm 70-90% cost).
- Fallback provider: OpenAI ↔ Anthropic, failover khi outage.
8. Observability
- Trace tất cả steps (agent, RAG, tool) — Langfuse/LangSmith.
- Metrics: containment rate (% resolved không cần human), CSAT, avg turns, latency p50/p95/p99, cost/conversation, top-RAG-miss queries, guardrail trigger rate.
- A/B test infra cho prompt/model/RAG changes.
9. Evaluation loop
- Golden dataset: 200-500 Q&A curated + edge case.
- Auto eval qua LLM-as-judge + weekly human spot check.
- Regression suite chạy trên mỗi deploy prompt/model.
- Flag low-quality conversation cho human review → mine thêm example.
10. Infrastructure
- Stateless API behind LB (horizontal scale).
- Async/queue cho long-running agent task (BullMQ, SQS).
- Multi-region deploy + CDN cho latency.
- Secrets via Vault/Secrets Manager.
Capacity sizing ví dụ: 10K user/day × 5 turn/conversation × 4K token/turn = ~200M token/day → chi phí ~$500-2000/day với GPT-4o-mini, tối ưu qua cache & routing giảm xuống 30-40%.
Trade-off cần thảo luận: latency vs quality (streaming response giúp perceived latency), cost vs accuracy (model routing), determinism vs creativity (temperature theo persona), self-host vs API (volume threshold ~100M token/day).