Guardrails cho LLM là gì? Cách implement input/output guardrails?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Guardrails là các lớp kiểm soát độc lập với LLM, đảm bảo input/output đáp ứng tiêu chí an toàn, chất lượng, tuân thủ trước khi ra người dùng. Giống "brake" — không làm model thông minh hơn, chỉ chặn các trường hợp xấu. Input guardrails (trước khi gọi LLM): - PII detection/redaction: Presidio, spaCy, regex cho số CMND, thẻ tín dụng, email — redact trước khi đưa lên LLM. - Prompt injection detection: heuristic (phát hiện "ignore previous...") + classifier model (Rebuff, LLM Guard). - Topic/scope filter: classifier hoặc embedding similarity — reject query ngoài scope (ví dụ tài chính bot bị hỏi y tế). - Toxicity / profanity: Perspective API, Detoxify. - Rate limit / quota theo user, IP. - Length check: reject prompt quá dài (tấn công kiệt ngân sách). Output guardrails (sau khi LLM trả về): - Schema validation: Pydantic, Zod, JSON schema — reject output không đúng format; retry. - Content safety: Perspective, Azure Content Safety, Llama Guard (phân loại harm category — hate, self-harm, sexual, violence). - PII leak check: scan output cho PII không được phép hiện. - Hallucination / faithfulness: với RAG — LLM thứ 2 check output vs context; reject/regenerate nếu faithfulness thấp. - Topic adherence: output có bám task không (dùng classifier hoặc judge LLM). - Jailbreak check: model không tiết lộ system prompt hay làm action ngoài phạm vi. - Competitor mention filter: regex/classifier block mention competitor trong response. Framework / tool: - NeMo Guardrails (NVIDIA) — DSL (Colang) định nghĩa dialogue flow + rails. - Guardrails AI — Python lib với validators có sẵn (PII, toxicity, format), RAIL spec. - LLM Guard — collection của input/output scanner. - Llama Guard (Meta) — model chuyên detect harm category. Pattern triển khai: 1. Fail fast — input guardrail reject sớm, tiết kiệm chi phí LLM. 2. Retry strategy — output guardrail fail → retry với prompt sửa (tối đa 1-2 lần) trước khi trả error. 3. Fallback response — có câu trả lời safe default ("Tôi không thể giúp việc này"). 4. Log & monitor — mỗi lần guardrail trigger log lại để tune; alert khi tỷ lệ block tăng đột biến (có thể bị abuse). 5. Defense in depth — nhiều lớp đồng thời; không dựa vào 1 lớp duy nhất.