Sign In Get Started

AI Benchmarks Explained

What each metric in the AI Value Index measures and why it matters. We track 28 benchmarks across 8 categories.

General Intelligence(7)Coding(6)Mathematics(3)Reasoning(4)Speed & Latency(2)Cost & Pricing(2)Context Window(2)Multimodal(2)

General Intelligence Benchmarks

7 metrics

Broad knowledge and reasoning benchmarks that test general intelligence across many domains.

Chatbot Arena ELO

Human preference ELO from blind head-to-head votes

Top 10 Models

1.Gemini 3.1 Pro

1501

2.Claude Opus 4.6

1496

1492

4.Grok 4.1 Fast

1482

1475

6.Doubao Seed 2.0

1474

7.Gemini 3 Flash

1473

8.Claude Opus 4.5

1468

9.Gemini 2.5 Pro

1465

1464

Measured in:ELO rating

Higher is better

Source:LMSYS / HuggingFace

MMLU-Pro

Massive Multitask Language Understanding professional benchmark

Top 10 Models

1.Gemini 3.1 Pro

92.6%

89.8%

3.Claude Opus 4.5

89.5%

4.Gemini 3 Flash

88.6%

87.0%

6.Grok 4.1 Fast

87.0%

86.0%

86.0%

86.0%

10.o3

84.0%

Measured in:Percentage (%)

Higher is better

Source:HuggingFace Leaderboard

AlpacaEval 2.0

Instruction following quality scored by GPT-4 as judge

Top 10 Models

87.6%

2.DeepSeek V3.1

70.0%

60.2%

57.5%

55.0%

6.Claude 3.5 Sonnet

52.4%

7.Gemini 2.0 Flash

51.5%

49.1%

9.Llama 3.3 70B

39.3%

33.0%

Measured in:Percentage (%)

Higher is better

Source:tatsu-lab

MT-Bench

Multi-turn conversation quality on 80 curated dialogues

Top 10 Models

1.o3

9.5

9.4

3.Gemini 2.5 Pro

9.3

9.3

5.Claude 3.7 Sonnet

9.2

6.Claude 3.5 Sonnet

9.1

7.Gemini 2.0 Flash

8.9

8.8

9.DeepSeek V3.1

8.8

10.Claude 3 Opus

8.8

Measured in:score

Higher is better

Source:LMSYS

IFEval

Strict instruction following accuracy on verifiable constraints

Top 10 Models

95.0%

2.o1

92.0%

3.o3

92.0%

4.Gemini 3.1 Pro

92.0%

5.Claude Opus 4.6

91.0%

91.0%

91.0%

8.Claude Opus 4.5

90.0%

90.0%

10.Claude Sonnet 4.6

89.0%

Measured in:Percentage (%)

Higher is better

Source:Google Research

SimpleQA

Factual accuracy on short, verifiable questions

Top 10 Models

62.5%

2.o3

49.4%

3.o1

47.0%

40.1%

36.4%

30.1%

7.Claude 3 Opus

29.0%

8.Claude 3.5 Sonnet

28.9%

9.Gemini 2.0 Flash

27.5%

10.DeepSeek V3.1

24.9%

Measured in:Percentage (%)

Higher is better

Source:OpenAI

TruthfulQA

Resistance to generating false but plausible answers

Top 10 Models

1.o3

86.0%

2.Claude 3.7 Sonnet

85.5%

85.0%

4.Gemini 2.5 Pro

84.8%

5.Claude 3.5 Sonnet

84.2%

83.5%

7.Claude 3 Opus

81.0%

8.Gemini 2.0 Flash

80.5%

79.0%

78.0%

Measured in:Percentage (%)

Higher is better

Source:Papers

Coding Benchmarks

6 metrics

Metrics that evaluate code generation, understanding, and real-world software engineering capabilities.

HumanEval+

Code generation correctness with extended tests

Top 10 Models

1.GPT-5.1 Codex

96.0%

95.0%

3.Claude Opus 4.6

93.5%

4.Gemini 3.1 Pro

93.0%

93.0%

92.0%

7.Claude Opus 4.5

92.0%

92.0%

9.Claude Sonnet 4.6

91.0%

10.Gemini 3 Pro

91.0%

Measured in:Percentage (%)

Higher is better

Source:Papers

SWE-bench Verified

Real-world software engineering task resolution

Top 10 Models

1.Claude Opus 4.5

80.9%

2.Claude Opus 4.6

80.8%

3.Gemini 3.1 Pro

80.6%

80.0%

5.Claude Sonnet 4.6

79.6%

6.Gemini 3 Flash

78.0%

7.GPT-5.1 Codex

78.0%

8.Claude Sonnet 4.5

77.2%

9.Qwen 3.5 397B

76.4%

76.3%

Measured in:Percentage (%)

Higher is better

Source:swebench.com

LiveCodeBench

Live competitive programming benchmark

Top 10 Models

1.GPT-5.1 Codex

85.0%

82.0%

80.0%

79.4%

78.0%

75.0%

7.Gemini 3.1 Pro

75.0%

8.Claude Opus 4.6

72.0%

72.0%

10.o3

70.0%

Measured in:Percentage (%)

Higher is better

Source:livecodebench.github.io

Aider Polyglot

Multi-language code editing accuracy with real git repos

Top 8 Models

79.6%

2.GPT-5.1 Codex

76.0%

3.Claude Opus 4.6

75.0%

4.DeepSeek V3.2

74.2%

5.Claude Opus 4.5

74.0%

6.Claude Sonnet 4.6

70.0%

7.Claude Sonnet 4.5

68.0%

8.GPT-5.1 Codex Mini

60.0%

Measured in:Percentage (%)

Higher is better

Source:aider.chat

BigCodeBench

Comprehensive code generation across diverse programming tasks

Top 1 Models

61.1%

Measured in:Percentage (%)

Higher is better

Source:bigcode-project

BFCL

Berkeley Function Calling Leaderboard — tool use accuracy

Top 7 Models

1.Claude Opus 4.6

70.4%

2.Claude Sonnet 4.6

70.3%

3.Claude Sonnet 4

70.3%

4.Claude Opus 4.5

70.0%

59.2%

59.2%

59.2%

Measured in:Percentage (%)

Higher is better

Source:Berkeley

Mathematics Benchmarks

3 metrics

Mathematical problem-solving benchmarks ranging from competition-level to graduate-level difficulty.

MATH

Competition mathematics problem solving

Top 10 Models

97.3%

97.0%

97.0%

4.Claude 3.7 Sonnet

96.2%

5.o3

96.0%

6.DeepSeek R1 0528

96.0%

95.0%

8.o1

94.8%

94.6%

93.0%

Measured in:Percentage (%)

Higher is better

Source:Papers

GSM8K

Grade school math word problems

Top 10 Models

99.0%

99.0%

3.o3

98.5%

98.5%

5.Gemini 3.1 Pro

98.0%

98.0%

7.o1

97.5%

97.5%

9.DeepSeek R1 0528

97.5%

97.5%

Measured in:Percentage (%)

Higher is better

Source:Papers

AIME 2025

American Invitational Mathematics Examination — competition math

Top 10 Models

1.Claude Opus 4.6

100.0%

100.0%

3.o3

98.4%

98.4%

5.Gemini 3.1 Pro

95.0%

95.0%

94.6%

94.0%

93.3%

93.0%

Measured in:Percentage (%)

Higher is better

Source:Papers

Reasoning Benchmarks

4 metrics

Benchmarks focused on logical reasoning, scientific understanding, and complex problem decomposition.

GPQA Diamond

Graduate-level science Q&A by domain experts

Top 10 Models

1.Gemini 3.1 Pro

94.1%

92.4%

91.9%

4.Claude Opus 4.6

91.3%

88.1%

87.3%

7.Claude Opus 4.5

87.0%

87.0%

85.0%

10.Claude 3.7 Sonnet

84.8%

Measured in:Percentage (%)

Higher is better

Source:Papers

ARC-AGI

Abstraction and Reasoning Corpus for general intelligence

Top 10 Models

1.Gemini 3.1 Pro

77.1%

2.o3

75.7%

70.0%

4.Claude Opus 4.6

60.0%

5.Claude Opus 4.5

57.0%

55.0%

54.2%

54.0%

9.GPT-5.1 Codex

52.0%

50.0%

Measured in:Percentage (%)

Higher is better

Source:arcprize.org

BBH

Big-Bench Hard — 23 challenging multi-step reasoning tasks

Top 10 Models

1.o3

95.8%

2.Gemini 2.5 Pro

94.5%

3.Claude 3.5 Sonnet

93.1%

4.Gemini 2.5 Flash

89.8%

89.0%

88.5%

7.DeepSeek V3.1

87.5%

87.3%

9.Claude 3 Opus

86.8%

10.Gemini 2.0 Flash

84.2%

Measured in:Percentage (%)

Higher is better

Source:Google Research

Winogrande

Commonsense reasoning via pronoun resolution

Top 10 Models

1.o3

94.0%

93.0%

92.5%

4.Claude 3.7 Sonnet

92.3%

5.Gemini 2.5 Pro

92.0%

6.Claude 3.5 Sonnet

91.8%

7.Claude 3 Opus

90.5%

8.Gemini 2.0 Flash

89.5%

89.0%

88.5%

Measured in:Percentage (%)

Higher is better

Source:Papers

Speed & Latency Benchmarks

2 metrics

Performance metrics measuring how quickly a model responds and generates output tokens.

Output Speed

Tokens generated per second

Top 10 Models

339

312

3.Gemini 2.5 Flash

251

250

5.Gemini 2.5 Flash Lite

240

6.Nova 2.0 Lite

221

7.Gemini 2.0 Flash

220

200

200

10.Gemini 3 Flash

200

Measured in:tok/s

Higher is better

Source:Aggregated

Time to First Token

Latency before first token arrives

Top 10 Models

1.Gemini 2.5 Flash Lite

70

80

3.Gemini 2.0 Flash

85

90

5.Claude 3.5 Haiku

100

6.Gemini 3 Flash

100

120

8.Claude Haiku 4.5

120

9.Mistral Small 3.2

120

10.GPT-4.1 Mini

130

Measured in:Milliseconds (ms)

Lower is better

Source:Aggregated

Cost & Pricing Benchmarks

2 metrics

Pricing metrics showing the cost per million tokens for input and output.

Input Cost

Cost per 1M input tokens

Top 10 Models

$0.05

$0.06

$0.10

4.Gemini 2.5 Flash Lite

$0.10

5.Gemini 2.0 Flash

$0.10

6.Llama 3.3 70B

$0.10

7.Mistral Small 3.2

$0.10

$0.10

$0.12

10.DeepSeek V3.2

$0.14

Measured in:$/1M

Lower is better

Source:OpenRouter API

Output Cost

Cost per 1M output tokens

Top 10 Models

$0.14

$0.24

3.DeepSeek V3.2

$0.28

4.DeepSeek V3.1

$0.28

5.Llama 3.3 70B

$0.30

$0.30

7.Mistral Small 3.2

$0.30

$0.39

$0.40

10.Gemini 2.5 Flash Lite

$0.40

Measured in:$/1M

Lower is better

Source:OpenRouter API

Context Window Benchmarks

2 metrics

Metrics related to the maximum amount of text a model can process in a single request.

Context Length

Maximum context window size

Top 10 Models

1.Llama 4 Scout

10.0M

1.0M

1.0M

1.0M

5.Gemini 2.5 Pro

1.0M

6.Gemini 2.5 Flash

1.0M

7.Gemini 2.5 Flash Lite

1.0M

8.Gemini 2.0 Flash

1.0M

9.Llama 4 Maverick

1.0M

1.0M

Measured in:Tokens

Higher is better

Source:OpenRouter API

RULER

Long-context understanding and retrieval accuracy at depth

Top 10 Models

1.Gemini 2.5 Pro

95.8%

93.2%

3.Claude 3.7 Sonnet

92.0%

91.6%

5.Gemini 2.5 Flash

90.2%

6.Claude 3.5 Sonnet

89.5%

85.9%

8.Gemini 2.0 Flash

85.5%

9.Llama 4 Maverick

83.5%

82.0%

Measured in:Percentage (%)

Higher is better

Source:Papers

Multimodal Benchmarks

2 metrics

Benchmarks evaluating visual understanding and image-text reasoning capabilities.

MMMU

Massive Multi-discipline Multimodal Understanding

Top 1 Models

1.Pixtral Large

68.0%

Measured in:Percentage (%)

Higher is better

Source:Papers

MathVista

Visual mathematical reasoning across diagrams and charts

Top 10 Models

1.o3

86.8%

84.3%

3.Gemini 2.5 Pro

80.4%

4.Gemini 2.5 Flash

75.5%

73.6%

6.Gemini 2.0 Flash

73.1%

7.Claude 3.7 Sonnet

72.0%

8.DeepSeek V3.1

68.4%

9.Claude 3.5 Sonnet

67.7%

10.Qwen 2.5 72B

67.5%

Measured in:Percentage (%)

Higher is better

Source:Papers

Frequently Asked Questions

What benchmarks are used to evaluate AI models?

The AI Value Index tracks 28 benchmarks across 8 categories: General Intelligence, Coding, Mathematics, Reasoning, Speed & Latency, Cost & Pricing, Context Window, Multimodal. These cover everything from general knowledge and coding ability to mathematical reasoning, speed, and cost.

What is SWE-bench and how does it measure coding ability?

SWE-bench Verified is a curated subset of 500 real-world GitHub issues drawn from 12 popular open-source Python repositories, where each problem has been manually validated by software engineers. It gives an AI agent the full repository codebase plus the original issue text, then requires the agent to locate the bug, edit the correct files, and produce a patch that passes all tests.

What is MMLU-Pro?

MMLU-Pro is a significantly harder evolution of the original MMLU benchmark featuring over 12,000 rigorously curated multiple-choice questions across 14 academic domains. Unlike the original MMLU's 4-option format, MMLU-Pro expands each question to 10 answer choices, reducing random guessing from 25% to 10%.

How is Chatbot Arena ELO calculated?

Chatbot Arena is a crowdsourced platform where real users submit prompts and receive responses from two anonymous AI models side by side, then vote for the one they prefer. The platform uses the Bradley-Terry model to convert millions of pairwise votes into a ranked leaderboard, having collected over 6 million votes across 400+ models.

What metrics matter most when choosing an AI model?

It depends on your use case. For general tasks, Chatbot Arena ELO and MMLU-Pro are key indicators. For software development, prioritize SWE-bench and HumanEval scores. For cost-sensitive applications, compare input and output pricing. For real-time applications, look at output speed and time to first token (TTFT). Use our AI Value Index to weight metrics based on your priorities.

About Our Benchmark Methodology

The AI Value Index tracks 28 benchmarks across 8 categories to give you a comprehensive view of AI model capabilities. Scores are sourced from official benchmark leaderboards, provider announcements, and independent evaluation platforms.

Use the AI Value Index to weight these benchmarks based on your priorities, or compare models side-by-side. Browse all model profiles or check pricing details.