Benchmark suites nổi tiếng: MMLU, HumanEval, GSM8K — đánh giá gì và hạn chế?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Benchmark là standardized test với dataset công khai + metric cố định, dùng compare model. Hiểu benchmark đúng giúp chọn model phù hợp và không bị marketing đánh lừa. Các benchmark chính (2025): A. General knowledge & reasoning: MMLU (Massive Multitask Language Understanding) — Hendrycks 2020 - 57 subject (law, medicine, history, math, CS...). - Format: multiple choice (A/B/C/D). - 15,908 questions. - Metric: accuracy. Random baseline 25%, human expert ~90%. - Score model 2025: Claude 3.5 Sonnet ~88%, GPT-4o ~88%, Llama 3.3 70B ~86%. - Limit: saturated ở top; format MC không test real-world task. MMLU-Pro — harder version, 10 choice, requires reasoning. GPT-4 ~72%, open models ~50-65%. GPQA (Graduate-level Google-Proof Q&A) — Rein 2023 - 448 question cấp PhD, đảm bảo không tra Google ra. - Human PhD ~74%, GPT-4 ~40%. - GPQA Diamond subset — model reasoning (o1, o3) bắt đầu gần human. ARC-AGI — Chollet — test abstract reasoning, cho đến 2024 o3 là đầu tiên gần human. B. Math & Reasoning: GSM8K — OpenAI 2021 - 8.5K math word problem cấp elementary school. - Test step-by-step reasoning. - GPT-4 ~92%, Claude 3.5 ~96%. - Hiện gần saturated. MATH — Hendrycks - 12.5K problem competition level (AMC, AIME). - Hard hơn GSM8K nhiều. - GPT-4o ~76%, o1 ~94%. AIME (American Invitational Math Examination) - 30 problem/year, mức competition olympiad. - o1 84%, o3 96.7% — human olympiad medalist. C. Code: HumanEval — OpenAI 2021 - 164 Python function completion problem. - Metric: pass@k — prob có ≥1 solution pass unit test trong k attempts. - GPT-4o pass@1 ~90%, Claude 3.5 ~92%. - Limit: function nhỏ, không test real codebase; leaked vào training data. MBPP (Mostly Basic Python Problems) — Google - 974 simple Python problem. SWE-bench — Princeton 2023 - 2,294 real GitHub issue + fix từ open-source repo. - Test end-to-end: đọc issue, navigate codebase, sinh patch, pass test. - SWE-bench Verified — human-validated subset. - Claude 3.7 Sonnet ~50%, GPT-4o ~30%, o3 ~70%+. - Benchmark thực tế nhất cho code agent. LiveCodeBench, BigCodeBench — contamination-resistant variants. D. Multi-task / Chatbot: MT-Bench — UC Berkeley - 80 multi-turn open-ended question. - Judged by GPT-4 (LLM-as-judge), score 1-10. - Claude 3.5 Sonnet 9.2, GPT-4o 9.1. Chatbot Arena — LMSys - Human blind pairwise vote giữa 2 model. - Produce Elo rating. - Reflect human preference thực, crowdsourced. - Hiện là benchmark được nhiều người tin nhất. Current top: o1, Claude 3.5 Sonnet, GPT-4o. E. Safety & alignment: TruthfulQA — test không hallucinate trên common misconceptions. ToxiGen — toxicity generation. BBQ — bias. JailbreakBench, HarmBench — adversarial safety. F. Long context: Needle-in-Haystack — insert fact vào context dài, ask model recall. - Test positional attention. - Gemini 1.5 Pro pass 99% ở 1M token. RULER — harder version, nhiều task với context dài. LongBench, InfiniteBench — diverse long-context eval. G. Agent: AgentBench — multi-environment agent test. WebArena, VisualWebArena — browser agent. ToolBench — tool use. τ-bench — customer service agent. Hạn chế chung của benchmark: 1. Contamination — model train trên internet có thể thấy test set → inflate score. Hard to verify. 2. Saturation — nhiều benchmark (MMLU, GSM8K, HumanEval) score > 90% top → không discriminate. 3. Goodhart's law — khi metric thành target, ngừng là measure tốt. Researchers optimize vào benchmark, chưa chắc real-world improve. 4. Format bias — multiple choice khác với open-ended; code completion khác với real codebase work. 5. Domain narrow — MMLU không test code, HumanEval không test reasoning dài. 6. Static — benchmark fixed; real-world task dynamic. 7. English-centric — most benchmark chỉ tiếng Anh; performance ngôn ngữ khác kém hơn đáng kể. How to use benchmark đúng: 1. Triangulate — không dựa 1 benchmark; xem 5-10 cái relevant task. 2. Test on your task — golden dataset riêng, production distribution. 3. Check contamination — prefer benchmark mới (GPQA 2024 > MMLU 2020). 4. Relative, not absolute — compare model A vs B, không đọc score tuyệt đối. 5. Human eval trên sample — automated benchmark miss nuance. 6. Public leaderboard: HuggingFace Open LLM Leaderboard, Chatbot Arena, SWE-bench leaderboard.