Red teaming = chủ động tấn công LLM để phát hiện failure mode, vulnerability, harmful behavior trước khi ship production. Pattern mượn từ security.
Mục tiêu:
1. Phát hiện harmful output — bias, toxic, illegal, misinfo.
2. Tìm jailbreak và prompt injection vulnerability.
3. Test edge case — ngôn ngữ hiếm, input dài, encoding lạ, ambiguity.
4. Kiểm tool/agent safety — agent có làm irreversible action sai không.
5. Probe PII leak / secret exposure.
6. Assess robustness dưới adversarial input.
Categories của harm (theo NIST AI RMF, OWASP LLM Top 10):
- CBRN (Chemical, Biological, Radiological, Nuclear) info.
- Cybercrime / malware generation.
- Self-harm, suicide instruction.
- Hate speech, discrimination.
- CSAM / sexual content.
- Political misinformation, election interference.
- Financial scam, phishing.
- Privacy violation (PII extraction).
- Copyright infringement.
Quy trình red teaming:
Giai đoạn 1: Threat modeling
- Xác định attack surface: user types, tools, data sources, outputs.
- Liệt kê scenarios cụ thể cần test (ví dụ: "user hỏi về self-harm", "indirect prompt injection qua email attachment").
- Prioritize theo risk × likelihood.
Giai đoạn 2: Test case generation
- Manual — expert red teamer viết prompt tấn công (hiệu quả cao, chi phí nhân sự).
- Automated — tool sinh adversarial prompt:
- PAIR, TAP — iterative attack model optimize prompt đến khi jailbreak.
- GCG — gradient-based adversarial suffix.
- Curator datasets: AdvBench, HarmBench, JailbreakBench, DoNotAnswer.
- Crowd-sourced — bug bounty (OpenAI, Anthropic có program).
Giai đoạn 3: Execution
- Chạy test case qua model → capture response.
- Scale: single prompt, multi-turn conversation, context với injection, tool use scenarios.
- Multilingual: tấn công bằng ngôn ngữ ít được align.
Giai đoạn 4: Scoring
- Binary: pass/fail (refused harmful ≠ fail, complied = fail).
- Judge: LLM-as-judge (Llama Guard, harmfulness classifier) hoặc human rate severity 1-5.
- Attack Success Rate (ASR) — % test cases jailbreak được.
Giai đoạn 5: Mitigation
- Pattern failure → fix:
- Thêm training data refuse pattern (RLHF red team data).
- Cập nhật system prompt.
- Thêm guardrail (Llama Guard input/output).
- Rate limit, monitoring.
- Re-test cycle.
Giai đoạn 6: Continuous
- Mỗi deploy prompt/model chạy lại red team suite.
- Add new attack discovered từ production abuse reports.
- Monthly/quarterly full red team exercise.
Frameworks & tools:
- Garak (NVIDIA) — scanner với hàng trăm probe (jailbreak, toxicity, leak).
- PyRIT (Microsoft) — framework red teaming với attack planner.
- AIF360 (IBM), LangKit — bias/fairness testing.
- Prompt Fuzzer — dynamic generation.
- HarmBench, JailbreakBench, AdvBench — academic benchmark.
- Protect AI, Lakera Red — commercial red team service.
Multi-modal red teaming: với VLM, test image-based injection (text ẩn trong ảnh hỏi model bỏ qua system prompt), typographic attack (ảnh chữ đánh lừa classification). Red teaming text-only miss những tấn công này.
Org structure: team red team độc lập (không phải team build) — tránh conflict of interest. Report trực tiếp lên security/compliance, không qua product. Công ty lớn (OpenAI, Anthropic, Google DeepMind) có dedicated team 20-50 người + external partners.
Red teaming = proactively attacking an LLM to discover failure modes, vulnerabilities, and harmful behaviors before shipping. Borrowed from security practice.
Goals:
1. Surface harmful outputs — bias, toxicity, illegal content, misinformation.
2. Find jailbreak and prompt injection vulnerabilities.
3. Test edge cases — rare languages, long input, odd encodings, ambiguity.
4. Check tool/agent safety — does the agent perform irreversible wrong actions.
5. Probe PII leaks / secret exposure.
6. Assess robustness under adversarial input.
Harm categories (per NIST AI RMF, OWASP LLM Top 10):
- CBRN (Chemical, Biological, Radiological, Nuclear) info.
- Cybercrime / malware generation.
- Self-harm, suicide instructions.
- Hate speech, discrimination.
- CSAM / sexual content.
- Political misinformation, election interference.
- Financial scams, phishing.
- Privacy violation (PII extraction).
- Copyright infringement.
Red-teaming process:
Phase 1: Threat modeling
- Identify the attack surface: user types, tools, data sources, outputs.
- Enumerate concrete scenarios to test ("user asks about self-harm", "indirect prompt injection via email attachment").
- Prioritize by risk × likelihood.
Phase 2: Test case generation
- Manual — expert red teamers craft attack prompts (high effectiveness, labor cost).
- Automated — tools generate adversarial prompts:
- PAIR, TAP — iterative attacker models optimize prompts until jailbreak.
- GCG — gradient-based adversarial suffix.
- Curated datasets: AdvBench, HarmBench, JailbreakBench, DoNotAnswer.
- Crowd-sourced — bug bounties (OpenAI, Anthropic run programs).
Phase 3: Execution
- Run test cases through the model → capture responses.
- Scale: single prompt, multi-turn conversations, context with injection, tool-use scenarios.
- Multilingual: attacks in under-aligned languages.
Phase 4: Scoring
- Binary: pass/fail (refused harmful = pass, complied = fail).
- Judge: LLM-as-judge (Llama Guard, harmfulness classifier) or human severity 1–5.
- Attack Success Rate (ASR) — % test cases that jailbreak.
Phase 5: Mitigation
- Recurring failure patterns → fix:
- Add RLHF red-team training data to refuse them.
- Update the system prompt.
- Add guardrails (Llama Guard input/output).
- Rate limit, monitoring.
- Re-test cycle.
Phase 6: Continuous
- Every prompt/model deploy re-runs the red team suite.
- Add new attacks discovered via production abuse reports.
- Monthly/quarterly full red team exercises.
Frameworks & tools:
- Garak (NVIDIA) — scanner with hundreds of probes (jailbreak, toxicity, leak).
- PyRIT (Microsoft) — red teaming framework with an attack planner.
- AIF360 (IBM), LangKit — bias/fairness testing.
- Prompt Fuzzer — dynamic generation.
- HarmBench, JailbreakBench, AdvBench — academic benchmarks.
- Protect AI, Lakera Red — commercial red team services.
Multi-modal red teaming: with VLMs, test image-based injection (hidden text in images asking the model to ignore the system prompt), typographic attacks (text-in-image fooling classifiers). Text-only red teaming misses these.
Org structure: keep the red team separate from the build team — avoid conflict of interest. Report to security/compliance, not product. Large companies (OpenAI, Anthropic, Google DeepMind) have dedicated teams of 20–50 + external partners.