Chia theo đòn bẩy, giảm nhiều nhất trước:
1. Prompt Caching (giảm 50-90% prefix cost) — cache system prompt, few-shot, tool schema, RAG context tái dùng. Prefix đặt ở đầu, phần thay đổi ở cuối. Anthropic cache 5min-1h; OpenAI automatic cho ≥1024 token.
2. Model Routing (giảm 60-90%) — model rẻ (Haiku, 4o-mini) cho task đơn giản, model mạnh cho task khó. Pattern: classifier route, hoặc cascade (thử model rẻ trước, escalate khi confidence thấp).
3. Output length control — limit max_tokens, prompt yêu cầu "concise". Output token đắt 4-5x input (GPT-4o: $2.5/$10 in/out — as of 2024, giá thay đổi với GPT-5/Claude 4).
4. Batching + semantic caching — batch offline workload (~50% rẻ hơn qua OpenAI/Anthropic batch API); cache response cho query ngữ nghĩa tương đồng (GPTCache, Redis+vector, hit rate 20-40% với FAQ).
Observability trước khi optimize: LangSmith/Langfuse/Helicone đo $/request, cache hit rate, top-expensive endpoint. 80% chi phí thường đến từ 20% endpoint. Luôn đo eval suite sau thay đổi — rẻ mà kém không phải tiết kiệm.
By lever, biggest wins first:
1. Prompt Caching (50–90% prefix savings) — cache system prompt, few-shot examples, tool schema, reused RAG context. Put prefix at the start, variable parts at the end. Anthropic explicit cache 5min–1h; OpenAI automatic for ≥1024 tokens.
2. Model Routing (60–90% savings) — cheap model (Haiku, 4o-mini) for easy tasks, strong model for hard ones. Patterns: classifier route, or cascade (try cheap first, escalate on low confidence).
3. Output length control — cap max_tokens, ask for "concise". Output tokens cost 4–5x input (GPT-4o: $2.5/$10 in/out — as of 2024; prices change with GPT-5/Claude 4).
4. Batching + semantic caching — batch offline workloads (~50% cheaper via OpenAI/Anthropic batch APIs); cache responses for semantically similar queries (GPTCache, Redis+vector, 20–40% hit rate on FAQ).
Observability first: LangSmith/Langfuse/Helicone to track $/request, cache hit rate, top-expensive endpoints. 80% of cost usually comes from 20% of endpoints. Always measure eval suite after changes — cheaper but worse isn't savings.