Benchmark là standardized test với dataset công khai + metric cố định, dùng compare model. Hiểu benchmark đúng giúp chọn model phù hợp và không bị marketing đánh lừa.
Các benchmark chính (2025):
A. General knowledge & reasoning:
MMLU (Massive Multitask Language Understanding) — Hendrycks 2020
- 57 subject (law, medicine, history, math, CS...).
- Format: multiple choice (A/B/C/D).
- 15,908 questions.
- Metric: accuracy. Random baseline 25%, human expert ~90%.
- Score model 2025: Claude 3.5 Sonnet ~88%, GPT-4o ~88%, Llama 3.3 70B ~86%.
- Limit: saturated ở top; format MC không test real-world task.
MMLU-Pro — harder version, 10 choice, requires reasoning. GPT-4 ~72%, open models ~50-65%.
GPQA (Graduate-level Google-Proof Q&A) — Rein 2023
- 448 question cấp PhD, đảm bảo không tra Google ra.
- Human PhD ~74%, GPT-4 ~40%.
- GPQA Diamond subset — model reasoning (o1, o3) bắt đầu gần human.
ARC-AGI — Chollet — test abstract reasoning, cho đến 2024 o3 là đầu tiên gần human.
B. Math & Reasoning:
GSM8K — OpenAI 2021
- 8.5K math word problem cấp elementary school.
- Test step-by-step reasoning.
- GPT-4 ~92%, Claude 3.5 ~96%.
- Hiện gần saturated.
MATH — Hendrycks
- 12.5K problem competition level (AMC, AIME).
- Hard hơn GSM8K nhiều.
- GPT-4o ~76%, o1 ~94%.
AIME (American Invitational Math Examination)
- 30 problem/year, mức competition olympiad.
- o1 84%, o3 96.7% — human olympiad medalist.
C. Code:
HumanEval — OpenAI 2021
- 164 Python function completion problem.
- Metric: pass@k — prob có ≥1 solution pass unit test trong k attempts.
- GPT-4o pass@1 ~90%, Claude 3.5 ~92%.
- Limit: function nhỏ, không test real codebase; leaked vào training data.
MBPP (Mostly Basic Python Problems) — Google
- 974 simple Python problem.
SWE-bench — Princeton 2023
- 2,294 real GitHub issue + fix từ open-source repo.
- Test end-to-end: đọc issue, navigate codebase, sinh patch, pass test.
- SWE-bench Verified — human-validated subset.
- Claude 3.7 Sonnet ~50%, GPT-4o ~30%, o3 ~70%+.
- Benchmark thực tế nhất cho code agent.
LiveCodeBench, BigCodeBench — contamination-resistant variants.
D. Multi-task / Chatbot:
MT-Bench — UC Berkeley
- 80 multi-turn open-ended question.
- Judged by GPT-4 (LLM-as-judge), score 1-10.
- Claude 3.5 Sonnet 9.2, GPT-4o 9.1.
Chatbot Arena — LMSys
- Human blind pairwise vote giữa 2 model.
- Produce Elo rating.
- Reflect human preference thực, crowdsourced.
- Hiện là benchmark được nhiều người tin nhất. Current top: o1, Claude 3.5 Sonnet, GPT-4o.
E. Safety & alignment:
TruthfulQA — test không hallucinate trên common misconceptions.
ToxiGen — toxicity generation.
BBQ — bias.
JailbreakBench, HarmBench — adversarial safety.
F. Long context:
Needle-in-Haystack — insert fact vào context dài, ask model recall.
- Test positional attention.
- Gemini 1.5 Pro pass 99% ở 1M token.
RULER — harder version, nhiều task với context dài.
LongBench, InfiniteBench — diverse long-context eval.
G. Agent:
AgentBench — multi-environment agent test.
WebArena, VisualWebArena — browser agent.
ToolBench — tool use.
τ-bench — customer service agent.
Hạn chế chung của benchmark:
1. Contamination — model train trên internet có thể thấy test set → inflate score. Hard to verify.
2. Saturation — nhiều benchmark (MMLU, GSM8K, HumanEval) score > 90% top → không discriminate.
3. Goodhart's law — khi metric thành target, ngừng là measure tốt. Researchers optimize vào benchmark, chưa chắc real-world improve.
4. Format bias — multiple choice khác với open-ended; code completion khác với real codebase work.
5. Domain narrow — MMLU không test code, HumanEval không test reasoning dài.
6. Static — benchmark fixed; real-world task dynamic.
7. English-centric — most benchmark chỉ tiếng Anh; performance ngôn ngữ khác kém hơn đáng kể.
How to use benchmark đúng:
1. Triangulate — không dựa 1 benchmark; xem 5-10 cái relevant task.
2. Test on your task — golden dataset riêng, production distribution.
3. Check contamination — prefer benchmark mới (GPQA 2024 > MMLU 2020).
4. Relative, not absolute — compare model A vs B, không đọc score tuyệt đối.
5. Human eval trên sample — automated benchmark miss nuance.
6. Public leaderboard: HuggingFace Open LLM Leaderboard, Chatbot Arena, SWE-bench leaderboard.