Debug RAG có nguyên tắc: chia pipeline thành stage, test từng stage độc lập. Đừng tune nhiều thứ cùng lúc — khó biết cái nào tạo impact.
Mô hình debug:
User Query
│
▼
┌─────────────────────────┐
│ A. Query Understanding │──── test: query rewriting work?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ B. Retrieval │──── test: golden doc trong top-K?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ C. Reranking │──── test: relevance top-3 đúng?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ D. Context Assembly │──── test: context có đủ info?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ E. Generation │──── test: answer faithful?
└─────────────────────────┘
│
▼
AnswerProcess debug có hệ thống:
Bước 1: Reproduce & classify failure
Thu thập 20-50 failure case từ production log hoặc user report. Phân loại:
- Retrieval miss — ground truth doc không có trong top-K retrieved.
- Retrieval rank bad — relevant doc có nhưng xếp thấp.
- Answer unfaithful — context đúng nhưng model hallucinate.
- Answer refuse đúng — context không có info, model nói "I don't know" (không phải fail).
- Answer refuse sai — info có trong context mà model không dùng.
- Format wrong — answer đúng nhưng format kém.
- Prompt injection succeed — user input làm model ignore system.
Bước 2: Isolate layer fail
Cho mỗi case, log:
- Original query.
- Rewritten query (nếu có).
- Retrieved docs (top-K với score).
- Reranked docs (nếu có).
- Final context.
- Ground truth doc (từ human-annotated expected).
- Model response.
- Expected response.
Tính metrics:
- Retrieval Recall@K: % case có ground truth trong top-K.
- Retrieval Precision@K: % chunks retrieve liên quan.
- Faithfulness: % response được support bởi context.
- Answer correctness: compare với expected.
Case có retrieval recall = 0 (ground truth không xuất hiện) → problem là retrieval, không phải generation.
Bước 3: Fix từng layer
Retrieval miss — ground truth không trong top-K:
□ Chunking size phù hợp? (thử 256, 512, 1024)
□ Overlap đủ? (10-20%)
□ Có bị cắt giữa section quan trọng?
□ Embedding model phù hợp domain? (thử BGE vs OpenAI)
□ Query có khác distribution doc nhiều không?
→ query transformation (HyDE, decomposition)
□ Cần hybrid search (BM25)? — đặc biệt với acronym, từ hiếm
□ Metadata filter có đang loại nhầm không?
□ Top-K quá thấp? Thử 20-50 thay vì 5Retrieval rank bad — có nhưng thấp:
□ Thêm reranker (Cohere, BGE) — cải thiện lớn nhất
□ Score threshold có phù hợp không?
□ Chunking nhỏ quá → relevant chunk bị "chia", rank thấp
→ parent-child chunking
□ Hybrid fusion weight (BM25 vs dense) — tune alphaUnfaithful answer — context đúng nhưng model sai:
□ Prompt instruction yếu?
"Answer ONLY from context. If missing, say 'I don't know'"
□ Context quá dài → "lost in the middle"
→ trim top-3-5 chunks only
→ sort chunks by relevance descending, put most relevant at start/end
□ Citation format required?
"Cite [doc X] for each claim"
□ Temperature cao?
→ set 0 cho factual task
□ Model không đủ mạnh?
→ thử Claude 3.5 Sonnet hoặc GPT-4o
□ Instruction conflict trong context?
→ user input có injection?Refuse sai — info có nhưng model không dùng:
□ Context noise quá nhiều → model confused
□ Instruction quá conservative → relax
□ Few-shot với case "có info → trả lời"
□ System prompt có bias nặng về refusal?Bước 4: Validate fix
- Run trên eval set (không chỉ case fix) để check không regress.
- Run trên golden dataset 100-500 case.
- Compare before/after: Recall@K, Faithfulness, Answer Correctness.
- A/B test trên shadow production traffic.
Bước 5: Root cause analysis
Không chỉ patch, hỏi tại sao:
- Pattern nào của query luôn fail? (add to eval set, specialized handling)
- Doc type nào khó index? (special parser)
- Cần expand KB không?
- User expectation mismatch → UX education.
Công cụ observability quan trọng:
- LangSmith, Langfuse, Phoenix — trace full pipeline.
- RAGAS — metric automation.
- Ragas, TruLens — online evaluation.
- Custom dashboard — retrieval recall, faithfulness trend qua time.
Anti-pattern debug:
- Đổi nhiều thứ cùng lúc → không biết cái nào fix.
- Chỉ nhìn 1 case → fix cho 1 case, regress cái khác.
- Không có eval set → không measure được improvement.
- Đổi model mà không re-tune chunking/prompt.
- Reject bug báo cáo của user → miss systemic issue.
Experience tip: 70% RAG issue thực ra là chunking + retrieval, 20% là prompt/context assembly, 10% là model. Đừng nhảy vào đổi model trước.
Debug RAG with a principle: split the pipeline into stages and test each independently. Don't tune many things at once — hard to know what's helping.
Debug model:
User Query
│
▼
┌─────────────────────────┐
│ A. Query Understanding │──── test: does rewriting work?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ B. Retrieval │──── test: is golden doc in top-K?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ C. Reranking │──── test: is top-3 relevance right?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ D. Context Assembly │──── test: is context sufficient?
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ E. Generation │──── test: is answer faithful?
└─────────────────────────┘
│
▼
AnswerSystematic debug process:
Step 1: Reproduce & classify failures
Collect 20–50 failure cases from production logs or user reports. Classify:
- Retrieval miss — ground truth doc not in top-K retrieved.
- Retrieval rank bad — relevant doc present but ranked low.
- Unfaithful answer — context correct but model hallucinates.
- Correct refusal — context lacks info, model says "I don't know" (not a failure).
- Wrong refusal — info is in context but model ignores it.
- Format wrong — correct answer, poor format.
- Prompt injection succeeded — user input made the model ignore the system.
Step 2: Isolate the failing layer
For each case, log:
- Original query.
- Rewritten query (if any).
- Retrieved docs (top-K with scores).
- Reranked docs (if any).
- Final context.
- Ground truth doc (human-annotated expected).
- Model response.
- Expected response.
Compute metrics:
- Retrieval Recall@K: % cases with ground truth in top-K.
- Retrieval Precision@K: % retrieved chunks relevant.
- Faithfulness: % responses supported by context.
- Answer correctness: vs expected.
Cases with retrieval recall = 0 (ground truth absent) → the problem is retrieval, not generation.
Step 3: Fix each layer
Retrieval miss — ground truth not in top-K:
□ Correct chunk size? (try 256, 512, 1024)
□ Enough overlap? (10–20%)
□ Cut through important sections?
□ Embedding model matched to domain? (try BGE vs OpenAI)
□ Query distribution very different from docs?
→ query transformation (HyDE, decomposition)
□ Need hybrid search (BM25)? — especially for acronyms, rare terms
□ Metadata filters accidentally excluding?
□ Top-K too small? Try 20–50 instead of 5Retrieval rank bad — present but low-ranked:
□ Add a reranker (Cohere, BGE) — usually the biggest win
□ Score threshold appropriate?
□ Chunks too small → relevant chunk fragmented, low rank
→ parent-child chunking
□ Hybrid fusion weight (BM25 vs dense) — tune alphaUnfaithful answer — context right, model wrong:
□ Weak prompt instruction?
"Answer ONLY from context. If missing, say 'I don't know'"
□ Context too long → "lost in the middle"
→ trim to top-3–5 chunks only
→ sort chunks by relevance descending, put most relevant first/last
□ Citations required?
"Cite [doc X] for each claim"
□ Temperature high?
→ set 0 for factual tasks
□ Weak model?
→ try Claude 3.5 Sonnet or GPT-4o
□ Instruction conflict in context?
→ is user input injected?Wrong refusal — info present but ignored:
□ Too much noise in context → model confused
□ Too-conservative instructions → relax them
□ Few-shot with "info present → answer" cases
□ Heavy refusal bias in system prompt?Step 4: Validate fixes
- Run on the eval set (not just fix cases) to check for regressions.
- Run on golden 100–500 cases.
- Compare before/after: Recall@K, Faithfulness, Answer Correctness.
- A/B test on shadow production traffic.
Step 5: Root cause analysis
Don't just patch, ask why:
- Which query patterns always fail? (add to eval, specialized handling)
- Which doc types are hard to index? (special parser)
- Does the KB need expansion?
- User expectation mismatch → UX education.
Key observability tools:
- LangSmith, Langfuse, Phoenix — full-pipeline tracing.
- RAGAS — metric automation.
- Ragas, TruLens — online evaluation.
- Custom dashboard — retrieval recall, faithfulness trend over time.
Debug anti-patterns:
- Change many things at once → can't attribute fixes.
- Look at a single case → fix for that one, regress others.
- No eval set → can't measure improvement.
- Swap models without re-tuning chunking/prompt.
- Dismiss user bug reports → miss systemic issues.
Experience tip: 70% of RAG issues are chunking + retrieval, 20% are prompt/context assembly, 10% are model. Don't jump to changing the model first.