AI Benchmarks Explained

What each metric in the AI Value Index measures and why it matters. We track 28 benchmarks across 8 categories.

General Intelligence Benchmarks

7 metrics

Broad knowledge and reasoning benchmarks that test general intelligence across many domains.

Coding Benchmarks

6 metrics

Metrics that evaluate code generation, understanding, and real-world software engineering capabilities.

BigCodeBench

Comprehensive code generation across diverse programming tasks

Top 1 Models

1.GPT-4o
61.1%
Measured in:Percentage (%)
Higher is better

Mathematics Benchmarks

3 metrics

Mathematical problem-solving benchmarks ranging from competition-level to graduate-level difficulty.

Reasoning Benchmarks

4 metrics

Benchmarks focused on logical reasoning, scientific understanding, and complex problem decomposition.

Speed & Latency Benchmarks

2 metrics

Performance metrics measuring how quickly a model responds and generates output tokens.

Cost & Pricing Benchmarks

2 metrics

Pricing metrics showing the cost per million tokens for input and output.

Context Window Benchmarks

2 metrics

Metrics related to the maximum amount of text a model can process in a single request.

Multimodal Benchmarks

2 metrics

Benchmarks evaluating visual understanding and image-text reasoning capabilities.

MMMU

Massive Multi-discipline Multimodal Understanding

Top 1 Models

Measured in:Percentage (%)
Higher is better
Source:Papers

Frequently Asked Questions

What benchmarks are used to evaluate AI models?

The AI Value Index tracks 28 benchmarks across 8 categories: General Intelligence, Coding, Mathematics, Reasoning, Speed & Latency, Cost & Pricing, Context Window, Multimodal. These cover everything from general knowledge and coding ability to mathematical reasoning, speed, and cost.

What is SWE-bench and how does it measure coding ability?

SWE-bench Verified is a curated subset of 500 real-world GitHub issues drawn from 12 popular open-source Python repositories, where each problem has been manually validated by software engineers. It gives an AI agent the full repository codebase plus the original issue text, then requires the agent to locate the bug, edit the correct files, and produce a patch that passes all tests.

What is MMLU-Pro?

MMLU-Pro is a significantly harder evolution of the original MMLU benchmark featuring over 12,000 rigorously curated multiple-choice questions across 14 academic domains. Unlike the original MMLU's 4-option format, MMLU-Pro expands each question to 10 answer choices, reducing random guessing from 25% to 10%.

How is Chatbot Arena ELO calculated?

Chatbot Arena is a crowdsourced platform where real users submit prompts and receive responses from two anonymous AI models side by side, then vote for the one they prefer. The platform uses the Bradley-Terry model to convert millions of pairwise votes into a ranked leaderboard, having collected over 6 million votes across 400+ models.

What metrics matter most when choosing an AI model?

It depends on your use case. For general tasks, Chatbot Arena ELO and MMLU-Pro are key indicators. For software development, prioritize SWE-bench and HumanEval scores. For cost-sensitive applications, compare input and output pricing. For real-time applications, look at output speed and time to first token (TTFT). Use our AI Value Index to weight metrics based on your priorities.

About Our Benchmark Methodology

The AI Value Index tracks 28 benchmarks across 8 categories to give you a comprehensive view of AI model capabilities. Scores are sourced from official benchmark leaderboards, provider announcements, and independent evaluation platforms.

Use the AI Value Index to weight these benchmarks based on your priorities, or compare models side-by-side. Browse all model profiles or check pricing details.