Chatbot Arena ELO
Human preference ELO from blind head-to-head votes
What each metric in the AI Value Index measures and why it matters. We track 28 benchmarks across 8 categories.
Broad knowledge and reasoning benchmarks that test general intelligence across many domains.
Human preference ELO from blind head-to-head votes
Massive Multitask Language Understanding professional benchmark
Instruction following quality scored by GPT-4 as judge
Multi-turn conversation quality on 80 curated dialogues
Strict instruction following accuracy on verifiable constraints
Factual accuracy on short, verifiable questions
Resistance to generating false but plausible answers
Metrics that evaluate code generation, understanding, and real-world software engineering capabilities.
Code generation correctness with extended tests
Real-world software engineering task resolution
Live competitive programming benchmark
Multi-language code editing accuracy with real git repos
Comprehensive code generation across diverse programming tasks
Berkeley Function Calling Leaderboard — tool use accuracy
Mathematical problem-solving benchmarks ranging from competition-level to graduate-level difficulty.
Competition mathematics problem solving
Grade school math word problems
American Invitational Mathematics Examination — competition math
Benchmarks focused on logical reasoning, scientific understanding, and complex problem decomposition.
Graduate-level science Q&A by domain experts
Abstraction and Reasoning Corpus for general intelligence
Big-Bench Hard — 23 challenging multi-step reasoning tasks
Commonsense reasoning via pronoun resolution
Performance metrics measuring how quickly a model responds and generates output tokens.
Tokens generated per second
Latency before first token arrives
Pricing metrics showing the cost per million tokens for input and output.
Cost per 1M input tokens
Cost per 1M output tokens
Metrics related to the maximum amount of text a model can process in a single request.
Maximum context window size
Long-context understanding and retrieval accuracy at depth
Benchmarks evaluating visual understanding and image-text reasoning capabilities.
Massive Multi-discipline Multimodal Understanding
Visual mathematical reasoning across diagrams and charts
The AI Value Index tracks 28 benchmarks across 8 categories: General Intelligence, Coding, Mathematics, Reasoning, Speed & Latency, Cost & Pricing, Context Window, Multimodal. These cover everything from general knowledge and coding ability to mathematical reasoning, speed, and cost.
SWE-bench Verified is a curated subset of 500 real-world GitHub issues drawn from 12 popular open-source Python repositories, where each problem has been manually validated by software engineers. It gives an AI agent the full repository codebase plus the original issue text, then requires the agent to locate the bug, edit the correct files, and produce a patch that passes all tests.
MMLU-Pro is a significantly harder evolution of the original MMLU benchmark featuring over 12,000 rigorously curated multiple-choice questions across 14 academic domains. Unlike the original MMLU's 4-option format, MMLU-Pro expands each question to 10 answer choices, reducing random guessing from 25% to 10%.
Chatbot Arena is a crowdsourced platform where real users submit prompts and receive responses from two anonymous AI models side by side, then vote for the one they prefer. The platform uses the Bradley-Terry model to convert millions of pairwise votes into a ranked leaderboard, having collected over 6 million votes across 400+ models.
It depends on your use case. For general tasks, Chatbot Arena ELO and MMLU-Pro are key indicators. For software development, prioritize SWE-bench and HumanEval scores. For cost-sensitive applications, compare input and output pricing. For real-time applications, look at output speed and time to first token (TTFT). Use our AI Value Index to weight metrics based on your priorities.
The AI Value Index tracks 28 benchmarks across 8 categories to give you a comprehensive view of AI model capabilities. Scores are sourced from official benchmark leaderboards, provider announcements, and independent evaluation platforms.
Use the AI Value Index to weight these benchmarks based on your priorities, or compare models side-by-side. Browse all model profiles or check pricing details.