Cách phát hiện và đo hallucination trong LLM output?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Hallucination = LLM sinh thông tin nghe có vẻ đúng nhưng sai, không có trong context/nguồn. 2 loại chính: intrinsic (mâu thuẫn với context được cung cấp — nặng nhất với RAG) và extrinsic (khẳng định fact không có trong context, không thể verify từ context). Phương pháp phát hiện: 1. Faithfulness / Groundedness check (cho RAG) — LLM-as-judge: cho judge model đọc (context, response) và yêu cầu trả lời: "Mỗi claim trong response có được support bởi context không?" Decompose response thành claims, check từng claim. Tool: RAGAS faithfulness, TruLens groundedness, DeepEval. Metric: % claims grounded. 2. Self-consistency / SelfCheckGPT — sample N response cho cùng câu hỏi (temperature cao), so sánh consistency. Nếu các response mâu thuẫn → khả năng hallucinate cao. Không cần reference. 3. Reference comparison — nếu có ground truth: dùng automated metric (BERTScore, ROUGE) hoặc LLM judge so response vs ground truth. Metric: answer correctness, factual recall. 4. Entailment-based (NLI model) — model NLI nhỏ (DeBERTa MNLI) check mỗi câu của response có bị entail bởi context không. Nhanh, rẻ; không cần LLM. 5. Token-level probability — hallucination thường đi kèm token probability thấp (model ít confident). Monitor mean log-prob của response; drop → nghi ngờ. Hạn chế: chỉ dùng được khi self-host có quyền truy cập logprob. 6. Named Entity verification — extract entity trong response (tên, số, ngày), cross-check với context hoặc knowledge base. Nhiều hallucination xảy ra với số liệu cụ thể. 7. Retrieval-augmented verification (FActScore, SAFE) — decompose response thành atomic claims → với mỗi claim tự retrieve từ web/knowledge base → judge model verify. Precise nhưng đắt. Mitigation (không chỉ detect): - Prompt siết: "chỉ trả lời dựa trên context, nếu không có info thì nói 'Tôi không biết'". - RAG tốt hơn: hybrid search + rerank + context chất lượng. - Temperature thấp (0-0.3) cho factual task. - Citation-required: yêu cầu mọi claim có [source:id]. - Hai lượt: generate → self-critic → correct. - Fine-tune với dataset có grounding tốt. - Model mạnh hơn (GPT-4, Claude 3.5) hallucinate ít hơn model nhỏ. Đo hệ thống trong production: sample random response hàng ngày, score faithfulness tự động, dashboard theo tuần. Alert khi drop > X%.