Cách đánh giá RAG end-to-end? RAGAS metrics hoạt động ra sao?

Question

Luyện Phỏng Vấn IT · Accepted Answer

RAG có 2 stage (retrieval + generation) → cần eval cả 2, không đủ chỉ đo final answer correctness.

RAGAS (Retrieval Augmented Generation Assessment) đưa framework metrics reference-free (không cần ground truth answer) dùng LLM-as-judge:

A. Retrieval quality:

1. Context Precision — trong các chunk retrieve, bao nhiêu % thực sự relevant tới query? Đo noise trong context. Cách đo: với mỗi chunk, hỏi judge "chunk này có hữu ích để trả lời query không?" → tính ratio hữu ích / total retrieved.

2. Context Recall — các fact cần để trả lời có nằm trong context không? Cần ground truth answer. Cách đo: decompose ground truth thành claims → với mỗi claim check có support trong context không → recall = claims_supported / total_claims.

3. Context Relevance — context nói chung có relevant với query không (aggregate score).

B. Generation quality:

4. Faithfulness / Groundedness — response có được support bởi context không (đo hallucination). Decompose response thành claims → với mỗi claim check context có support không → faithfulness = supported_claims / total_claims. Quan trọng nhất cho RAG.

5. Answer Relevancy — response có trả lời đúng query không (có thể đúng nhưng off-topic). Cách đo RAGAS: từ response, LLM generate N "reverse questions" → embed → so với embedding query gốc → relevancy = mean cosine similarity. High nếu response gợi được query gần giống gốc.

6. Answer Correctness — cần ground truth; so response vs ground truth bằng semantic similarity + factual overlap.

C. Pipeline-level:

7. Noise Sensitivity — response có bị thay đổi khi thêm chunk nhiễu không (robustness).

8. Latency & Cost — eval thực dụng: time_to_first_token, total latency, $/query.

Quy trình eval chuẩn:
1. Xây golden dataset 100-500 câu có (query, expected_context_docs, ground_truth_answer).
2. Chạy RAG pipeline → log (query, retrieved, response).
3. Tính metrics RAGAS qua LLM judge (GPT-4 hoặc Claude).
4. Dashboard theo metric qua các version prompt/chunking/model.
5. Regression suite trong CI/CD.

Tool:
- RAGAS (Python lib, chuẩn de-facto).
- DeepEval (metrics + pytest integration).
- TruLens (real-time monitoring + feedback).
- Arize Phoenix (eval + trace).
- LangSmith / Langfuse (trace + eval UI).

Caveat: LLM-as-judge có bias (position, self-preference, verbosity) → validate bằng human spot-check ~10% sample định kỳ. Tools: RAGAS (de-facto), DeepEval, TruLens, Arize Phoenix, LangSmith.