Implement semantic caching cho LLM query (cache response cho query tương đồng).

Question

Luyện Phỏng Vấn IT · Accepted Answer

Cache truyền thống (Redis, Memcached) dùng exact key match — user gõ "thủ đô VN?" và "Việt Nam có thủ đô gì?" → cache miss dù ý giống nhau. Semantic cache embed query, tìm query tương đồng trong cache, trả response đã cache nếu đủ giống. Lợi ích: 20-40% hit rate cho customer support / FAQ / Q&A bot → giảm cost + latency tương ứng. Implementation đơn giản (Python, dùng Qdrant làm vector store): Nâng cấp production: 1. TTL (time-to-live) — lưu timestamp, expire sau N giờ/ngày. Quan trọng cho query theo time (giá, tỷ giá, weather). 2. Context-aware keys — chỉ cache query không có user-specific context: - User hỏi "số dư tài khoản tôi là bao nhiêu?" → KHÔNG cache (per-user). - User hỏi "phí chuyển khoản là bao nhiêu?" → cache OK (general). - Classify trước khi store. 3. Threshold tuning: - 0.95+ → conservative, ít hit nhưng chính xác. - 0.85-0.92 → aggressive, nhiều hit nhưng có risk false positive. - Measure: cho 100 query, manual check cache hit có thực sự đúng không. 4. Cache invalidation — source data update → invalidate cache. Tag cache với source version. 5. Multi-tenancy — namespace per tenant; 1 tenant không thấy cache của tenant khác. 6. Partial cache — cache từng component (RAG retrieval result, embedding) riêng, không chỉ full response. 7. Monitoring: - Hit rate overall và per endpoint. - Cost saved ($/hour). - False positive rate (response sai do cache match nhầm). - Cache size, eviction rate. Tool sẵn có: - GPTCache (thư viện Python) — semantic cache layer có sẵn, support nhiều vector backend + LLM provider. Drop-in replacement cho OpenAI client: - LangChain — có CacheLLM với semantic option. - Redis + RediSearch — vector search built-in. - Portkey, Helicone — commercial gateway có semantic cache. Limitation & caveat: - Chat multi-turn khó cache: context phụ thuộc history → semantic match khó. Thường chỉ cache single-turn query. - Personalized response không cache được. - Creative task (viết email, story) cache → user nhận cùng output → bad UX. - Cache stored trong DB phải encrypt / PII-scrub trước (compliance). - Threshold quá thấp → false positive; quá cao → hit rate thấp. Rule: bật semantic cache sau khi có baseline cost observability; đo hit rate thực → tune threshold. Không blind-apply.