RAG có 2 stage (retrieval + generation) → cần eval cả 2, không đủ chỉ đo final answer correctness.
RAGAS (Retrieval Augmented Generation Assessment) đưa framework metrics reference-free (không cần ground truth answer) dùng LLM-as-judge:
A. Retrieval quality:
1. Context Precision — trong các chunk retrieve, bao nhiêu % thực sự relevant tới query? Đo noise trong context. Cách đo: với mỗi chunk, hỏi judge "chunk này có hữu ích để trả lời query không?" → tính ratio hữu ích / total retrieved.
2. Context Recall — các fact cần để trả lời có nằm trong context không? Cần ground truth answer. Cách đo: decompose ground truth thành claims → với mỗi claim check có support trong context không → recall = claims_supported / total_claims.
3. Context Relevance — context nói chung có relevant với query không (aggregate score).
B. Generation quality:
4. Faithfulness / Groundedness — response có được support bởi context không (đo hallucination). Decompose response thành claims → với mỗi claim check context có support không → faithfulness = supported_claims / total_claims. Quan trọng nhất cho RAG.
5. Answer Relevancy — response có trả lời đúng query không (có thể đúng nhưng off-topic). Cách đo RAGAS: từ response, LLM generate N "reverse questions" → embed → so với embedding query gốc → relevancy = mean cosine similarity. High nếu response gợi được query gần giống gốc.
6. Answer Correctness — cần ground truth; so response vs ground truth bằng semantic similarity + factual overlap.
C. Pipeline-level:
7. Noise Sensitivity — response có bị thay đổi khi thêm chunk nhiễu không (robustness).
8. Latency & Cost — eval thực dụng: time_to_first_token, total latency, $/query.
Quy trình eval chuẩn:
1. Xây golden dataset 100-500 câu có (query, expected_context_docs, ground_truth_answer).
2. Chạy RAG pipeline → log (query, retrieved, response).
3. Tính metrics RAGAS qua LLM judge (GPT-4 hoặc Claude).
4. Dashboard theo metric qua các version prompt/chunking/model.
5. Regression suite trong CI/CD.
Tool:
- RAGAS (Python lib, chuẩn de-facto).
- DeepEval (metrics + pytest integration).
- TruLens (real-time monitoring + feedback).
- Arize Phoenix (eval + trace).
- LangSmith / Langfuse (trace + eval UI).
Caveat: LLM-as-judge có bias (position, self-preference, verbosity) → validate bằng human spot-check ~10% sample định kỳ. Tools: RAGAS (de-facto), DeepEval, TruLens, Arize Phoenix, LangSmith.
RAG has 2 stages (retrieval + generation) → evaluate both, not just final answer correctness.
RAGAS (Retrieval Augmented Generation Assessment) provides reference-free metrics (no ground truth answer required) via LLM-as-judge:
A. Retrieval quality:
1. Context Precision — what % of retrieved chunks are actually relevant to the query? Measures context noise. How: per chunk, ask the judge "is this useful to answer the query?" → useful/total.
2. Context Recall — are the facts needed for the answer present in context? Needs ground truth answer. How: decompose ground truth into claims → per claim check if supported by context → recall = supported / total.
3. Context Relevance — overall context relevance to the query (aggregate score).
B. Generation quality:
4. Faithfulness / Groundedness — is the response supported by the context (measures hallucination). Decompose response into claims → per claim check support in context → faithfulness = supported / total. Most important for RAG.
5. Answer Relevancy — does the response actually answer the query (can be correct but off-topic). RAGAS measure: from the response, an LLM generates N "reverse questions" → embed → compare to original query embedding → relevancy = mean cosine. High if the response implies queries close to the original.
6. Answer Correctness — needs ground truth; compares response to ground truth via semantic similarity + factual overlap.
Standard eval workflow:
1. Build a golden dataset of 100–500 (query, expected_context_docs, ground_truth_answer).
2. Run RAG pipeline → log (query, retrieved, response).
3. Compute RAGAS metrics via LLM judge (GPT-4 or Claude).
4. Dashboard metrics across prompt/chunking/model versions.
5. Regression suite in CI/CD.
Caveat: LLM-as-judge has biases (position, self-preference, verbosity) → validate with periodic human spot-check on ~10% samples. Tools: RAGAS (de-facto), DeepEval, TruLens, Arize Phoenix, LangSmith.