Benchmark là standardized test với dataset công khai + metric cố định, dùng compare model. Hiểu benchmark đúng giúp chọn model phù hợp và không bị marketing đánh lừa.
Các benchmark chính (2025):
A. General knowledge & reasoning:
MMLU (Massive Multitask Language Understanding) — Hendrycks 2020
- 57 subject (law, medicine, history, math, CS...).
- Format: multiple choice (A/B/C/D).
- 15,908 questions.
- Metric: accuracy. Random baseline 25%, human expert ~90%.
- Score model 2025: Claude 3.5 Sonnet ~88%, GPT-4o ~88%, Llama 3.3 70B ~86%.
- Limit: saturated ở top; format MC không test real-world task.
MMLU-Pro — harder version, 10 choice, requires reasoning. GPT-4 ~72%, open models ~50-65%.
GPQA (Graduate-level Google-Proof Q&A) — Rein 2023
- 448 question cấp PhD, đảm bảo không tra Google ra.
- Human PhD ~74%, GPT-4 ~40%.
- GPQA Diamond subset — model reasoning (o1, o3) bắt đầu gần human.
ARC-AGI — Chollet — test abstract reasoning, cho đến 2024 o3 là đầu tiên gần human.
B. Math & Reasoning:
GSM8K — OpenAI 2021
- 8.5K math word problem cấp elementary school.
- Test step-by-step reasoning.
- GPT-4 ~92%, Claude 3.5 ~96%.
- Hiện gần saturated.
MATH — Hendrycks
- 12.5K problem competition level (AMC, AIME).
- Hard hơn GSM8K nhiều.
- GPT-4o ~76%, o1 ~94%.
AIME (American Invitational Math Examination)
- 30 problem/year, mức competition olympiad.
- o1 84%, o3 96.7% — human olympiad medalist.
C. Code:
HumanEval — OpenAI 2021
- 164 Python function completion problem.
- Metric: pass@k — prob có ≥1 solution pass unit test trong k attempts.
- GPT-4o pass@1 ~90%, Claude 3.5 ~92%.
- Limit: function nhỏ, không test real codebase; leaked vào training data.
MBPP (Mostly Basic Python Problems) — Google
- 974 simple Python problem.
SWE-bench — Princeton 2023
- 2,294 real GitHub issue + fix từ open-source repo.
- Test end-to-end: đọc issue, navigate codebase, sinh patch, pass test.
- SWE-bench Verified — human-validated subset.
- Claude 3.7 Sonnet ~50%, GPT-4o ~30%, o3 ~70%+.
- Benchmark thực tế nhất cho code agent.
LiveCodeBench, BigCodeBench — contamination-resistant variants.
D. Multi-task / Chatbot:
MT-Bench — UC Berkeley
- 80 multi-turn open-ended question.
- Judged by GPT-4 (LLM-as-judge), score 1-10.
- Claude 3.5 Sonnet 9.2, GPT-4o 9.1.
Chatbot Arena — LMSys
- Human blind pairwise vote giữa 2 model.
- Produce Elo rating.
- Reflect human preference thực, crowdsourced.
- Hiện là benchmark được nhiều người tin nhất. Current top: o1, Claude 3.5 Sonnet, GPT-4o.
E. Safety & alignment:
TruthfulQA — test không hallucinate trên common misconceptions.
ToxiGen — toxicity generation.
BBQ — bias.
JailbreakBench, HarmBench — adversarial safety.
F. Long context:
Needle-in-Haystack — insert fact vào context dài, ask model recall.
- Test positional attention.
- Gemini 1.5 Pro pass 99% ở 1M token.
RULER — harder version, nhiều task với context dài.
LongBench, InfiniteBench — diverse long-context eval.
G. Agent:
AgentBench — multi-environment agent test.
WebArena, VisualWebArena — browser agent.
ToolBench — tool use.
τ-bench — customer service agent.
Hạn chế chung của benchmark:
1. Contamination — model train trên internet có thể thấy test set → inflate score. Hard to verify.
2. Saturation — nhiều benchmark (MMLU, GSM8K, HumanEval) score > 90% top → không discriminate.
3. Goodhart's law — khi metric thành target, ngừng là measure tốt. Researchers optimize vào benchmark, chưa chắc real-world improve.
4. Format bias — multiple choice khác với open-ended; code completion khác với real codebase work.
5. Domain narrow — MMLU không test code, HumanEval không test reasoning dài.
6. Static — benchmark fixed; real-world task dynamic.
7. English-centric — most benchmark chỉ tiếng Anh; performance ngôn ngữ khác kém hơn đáng kể.
How to use benchmark đúng:
1. Triangulate — không dựa 1 benchmark; xem 5-10 cái relevant task.
2. Test on your task — golden dataset riêng, production distribution.
3. Check contamination — prefer benchmark mới (GPQA 2024 > MMLU 2020).
4. Relative, not absolute — compare model A vs B, không đọc score tuyệt đối.
5. Human eval trên sample — automated benchmark miss nuance.
6. Public leaderboard: HuggingFace Open LLM Leaderboard, Chatbot Arena, SWE-bench leaderboard.
Benchmarks are standardized tests with public datasets + fixed metrics to compare models. Understanding them correctly helps you pick the right model and avoid marketing spin.
Key benchmarks (2025):
A. General knowledge & reasoning:
MMLU (Massive Multitask Language Understanding) — Hendrycks 2020
- 57 subjects (law, medicine, history, math, CS...).
- Format: multiple choice (A/B/C/D).
- 15,908 questions.
- Metric: accuracy. Random baseline 25%, human expert ~90%.
- 2025 scores: Claude 3.5 Sonnet ~88%, GPT-4o ~88%, Llama 3.3 70B ~86%.
- Limit: saturated at the top; MC format doesn't test real-world tasks.
MMLU-Pro — harder version, 10 choices, reasoning-heavy. GPT-4 ~72%, open models ~50–65%.
GPQA (Graduate-level Google-Proof Q&A) — Rein 2023
- 448 PhD-level questions guaranteed ungoogleable.
- Human PhD ~74%, GPT-4 ~40%.
- GPQA Diamond subset — reasoning models (o1, o3) approach human.
ARC-AGI — Chollet — tests abstract reasoning; as of 2024, o3 is the first to approach human.
B. Math & Reasoning:
GSM8K — OpenAI 2021
- 8.5K elementary-school math word problems.
- Tests step-by-step reasoning.
- GPT-4 ~92%, Claude 3.5 ~96%.
- Nearly saturated now.
MATH — Hendrycks
- 12.5K competition-level problems (AMC, AIME).
- Much harder than GSM8K.
- GPT-4o ~76%, o1 ~94%.
AIME (American Invitational Math Examination)
- 30 problems/year, olympiad competition level.
- o1 84%, o3 96.7% — human olympiad medalist tier.
C. Code:
HumanEval — OpenAI 2021
- 164 Python function-completion problems.
- Metric: pass@k — probability ≥1 solution passes unit tests in k attempts.
- GPT-4o pass@1 ~90%, Claude 3.5 ~92%.
- Limits: tiny functions, doesn't test real codebases; leaked into training data.
MBPP (Mostly Basic Python Problems) — Google
- 974 simple Python problems.
SWE-bench — Princeton 2023
- 2,294 real GitHub issues + fixes from open-source repos.
- Tests end-to-end: read issue, navigate codebase, generate patch, pass tests.
- SWE-bench Verified — human-validated subset.
- Claude 3.7 Sonnet ~50%, GPT-4o ~30%, o3 ~70%+.
- Most realistic benchmark for code agents.
LiveCodeBench, BigCodeBench — contamination-resistant variants.
D. Multi-task / Chatbot:
MT-Bench — UC Berkeley
- 80 multi-turn open-ended questions.
- Judged by GPT-4 (LLM-as-judge), 1–10 score.
- Claude 3.5 Sonnet 9.2, GPT-4o 9.1.
Chatbot Arena — LMSys
- Human blind pairwise votes between 2 models.
- Produces Elo ratings.
- Reflects real human preference, crowdsourced.
- Currently the most trusted benchmark. Top: o1, Claude 3.5 Sonnet, GPT-4o.
E. Safety & alignment:
TruthfulQA — tests resistance to hallucinating on common misconceptions.
ToxiGen — toxicity generation.
BBQ — bias.
JailbreakBench, HarmBench — adversarial safety.
F. Long context:
Needle-in-Haystack — insert a fact in long context, ask the model to recall.
- Tests positional attention.
- Gemini 1.5 Pro passes 99% at 1M tokens.
RULER — harder version, many long-context tasks.
LongBench, InfiniteBench — diverse long-context eval.
G. Agent:
AgentBench — multi-environment agent testing.
WebArena, VisualWebArena — browser agents.
ToolBench — tool use.
τ-bench — customer service agents.
General limitations:
1. Contamination — models trained on the internet may have seen the test set → inflated scores. Hard to verify.
2. Saturation — many benchmarks (MMLU, GSM8K, HumanEval) have top scores > 90% → no discrimination.
3. Goodhart's law — once a metric becomes a target, it stops being a good measure. Researchers optimize for benchmarks without necessarily improving real-world.
4. Format bias — multiple choice differs from open-ended; code completion differs from real codebase work.
5. Narrow domains — MMLU doesn't test code, HumanEval doesn't test long reasoning.
6. Static — benchmarks are fixed; real-world tasks are dynamic.
7. English-centric — most benchmarks are English only; other-language performance is much weaker.
How to use benchmarks correctly:
1. Triangulate — don't rely on one benchmark; check 5–10 relevant ones.
2. Test on your task — your own golden dataset, production distribution.
3. Check contamination — prefer newer benchmarks (GPQA 2024 > MMLU 2020).
4. Relative, not absolute — compare model A vs B, don't read absolute scores.
5. Human eval on samples — automated benchmarks miss nuance.
6. Public leaderboards: HuggingFace Open LLM Leaderboard, Chatbot Arena, SWE-bench leaderboard.