Red teaming LLM là gì? Quy trình và testing framework?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Red teaming = chủ động tấn công LLM để phát hiện failure mode, vulnerability, harmful behavior trước khi ship production. Pattern mượn từ security. Mục tiêu: 1. Phát hiện harmful output — bias, toxic, illegal, misinfo. 2. Tìm jailbreak và prompt injection vulnerability. 3. Test edge case — ngôn ngữ hiếm, input dài, encoding lạ, ambiguity. 4. Kiểm tool/agent safety — agent có làm irreversible action sai không. 5. Probe PII leak / secret exposure. 6. Assess robustness dưới adversarial input. Categories của harm (theo NIST AI RMF, OWASP LLM Top 10): - CBRN (Chemical, Biological, Radiological, Nuclear) info. - Cybercrime / malware generation. - Self-harm, suicide instruction. - Hate speech, discrimination. - CSAM / sexual content. - Political misinformation, election interference. - Financial scam, phishing. - Privacy violation (PII extraction). - Copyright infringement. Quy trình red teaming: Giai đoạn 1: Threat modeling - Xác định attack surface: user types, tools, data sources, outputs. - Liệt kê scenarios cụ thể cần test (ví dụ: "user hỏi về self-harm", "indirect prompt injection qua email attachment"). - Prioritize theo risk × likelihood. Giai đoạn 2: Test case generation - Manual — expert red teamer viết prompt tấn công (hiệu quả cao, chi phí nhân sự). - Automated — tool sinh adversarial prompt: - PAIR, TAP — iterative attack model optimize prompt đến khi jailbreak. - GCG — gradient-based adversarial suffix. - Curator datasets: AdvBench, HarmBench, JailbreakBench, DoNotAnswer. - Crowd-sourced — bug bounty (OpenAI, Anthropic có program). Giai đoạn 3: Execution - Chạy test case qua model → capture response. - Scale: single prompt, multi-turn conversation, context với injection, tool use scenarios. - Multilingual: tấn công bằng ngôn ngữ ít được align. Giai đoạn 4: Scoring - Binary: pass/fail (refused harmful ≠ fail, complied = fail). - Judge: LLM-as-judge (Llama Guard, harmfulness classifier) hoặc human rate severity 1-5. - Attack Success Rate (ASR) — % test cases jailbreak được. Giai đoạn 5: Mitigation - Pattern failure → fix: - Thêm training data refuse pattern (RLHF red team data). - Cập nhật system prompt. - Thêm guardrail (Llama Guard input/output). - Rate limit, monitoring. - Re-test cycle. Giai đoạn 6: Continuous - Mỗi deploy prompt/model chạy lại red team suite. - Add new attack discovered từ production abuse reports. - Monthly/quarterly full red team exercise. Frameworks & tools: - Garak (NVIDIA) — scanner với hàng trăm probe (jailbreak, toxicity, leak). - PyRIT (Microsoft) — framework red teaming với attack planner. - AIF360 (IBM), LangKit — bias/fairness testing. - Prompt Fuzzer — dynamic generation. - HarmBench, JailbreakBench, AdvBench — academic benchmark. - Protect AI, Lakera Red — commercial red team service. Multi-modal red teaming: với VLM, test image-based injection (text ẩn trong ảnh hỏi model bỏ qua system prompt), typographic attack (ảnh chữ đánh lừa classification). Red teaming text-only miss những tấn công này. Org structure: team red team độc lập (không phải team build) — tránh conflict of interest. Report trực tiếp lên security/compliance, không qua product. Công ty lớn (OpenAI, Anthropic, Google DeepMind) có dedicated team 20-50 người + external partners.