Quy trình debug một RAG system đang perform kém (answer sai, retrieval miss)?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Debug RAG có nguyên tắc: chia pipeline thành stage, test từng stage độc lập. Đừng tune nhiều thứ cùng lúc — khó biết cái nào tạo impact. Mô hình debug: Process debug có hệ thống: Bước 1: Reproduce & classify failure Thu thập 20-50 failure case từ production log hoặc user report. Phân loại: - Retrieval miss — ground truth doc không có trong top-K retrieved. - Retrieval rank bad — relevant doc có nhưng xếp thấp. - Answer unfaithful — context đúng nhưng model hallucinate. - Answer refuse đúng — context không có info, model nói "I don't know" (không phải fail). - Answer refuse sai — info có trong context mà model không dùng. - Format wrong — answer đúng nhưng format kém. - Prompt injection succeed — user input làm model ignore system. Bước 2: Isolate layer fail Cho mỗi case, log: - Original query. - Rewritten query (nếu có). - Retrieved docs (top-K với score). - Reranked docs (nếu có). - Final context. - Ground truth doc (từ human-annotated expected). - Model response. - Expected response. Tính metrics: - Retrieval Recall@K: % case có ground truth trong top-K. - Retrieval Precision@K: % chunks retrieve liên quan. - Faithfulness: % response được support bởi context. - Answer correctness: compare với expected. Case có retrieval recall = 0 (ground truth không xuất hiện) → problem là retrieval, không phải generation. Bước 3: Fix từng layer Retrieval miss — ground truth không trong top-K: Retrieval rank bad — có nhưng thấp: Unfaithful answer — context đúng nhưng model sai: Refuse sai — info có nhưng model không dùng: Bước 4: Validate fix - Run trên eval set (không chỉ case fix) để check không regress. - Run trên golden dataset 100-500 case. - Compare before/after: Recall@K, Faithfulness, Answer Correctness. - A/B test trên shadow production traffic. Bước 5: Root cause analysis Không chỉ patch, hỏi tại sao: - Pattern nào của query luôn fail? (add to eval set, specialized handling) - Doc type nào khó index? (special parser) - Cần expand KB không? - User expectation mismatch → UX education. Công cụ observability quan trọng: - LangSmith, Langfuse, Phoenix — trace full pipeline. - RAGAS — metric automation. - Ragas, TruLens — online evaluation. - Custom dashboard — retrieval recall, faithfulness trend qua time. Anti-pattern debug: - Đổi nhiều thứ cùng lúc → không biết cái nào fix. - Chỉ nhìn 1 case → fix cho 1 case, regress cái khác. - Không có eval set → không measure được improvement. - Đổi model mà không re-tune chunking/prompt. - Reject bug báo cáo của user → miss systemic issue. Experience tip: 70% RAG issue thực ra là chunking + retrieval, 20% là prompt/context assembly, 10% là model. Đừng nhảy vào đổi model trước.