Delivering Transparency
in LLM Benchmarking

We use multi-LLM evaluation for accurate and unbiased evaluation of LLM quality, cost and speed. AutoBench resists gaming by changing at each run.

Our system uses 20+ LLM models to generate granular benchmarks that score 87% correlation with AAII and 77%+ with LMArena.

Request Custom Run

Learn More

Our new benchmarking platform is out: AutoBench 2.0. We used it to run our latest benchmark (Run 5 - December 17, 2025). All data below or in our Hugging Face Leaderboard.

Leaderboards

QUALITY

Average Score combining domain-specific Autobench scores; Higher is better

PRICE

USD cent per average answer; Lower is better

LATENCY

Average Latency in Seconds; Lower is better

gemini-2.5-flash-lite
20.42s
grok-4.1-fast
23.60s
Nemotron-3-nano-30b-a3b
30.08s
Ministral-8b-2512
31.40s
Gpt-oss-20b
38.77s
Gemini-3-flash-preview
45.56s
nova-premier-v1
51.84s
Mistral-medium-3.1
52.25s
Nova-2-lite-v1
61.46s
gemini-2.5-flash
65.62s
nemotron-nano-9b-v2
66.78s
grok-4.1-fast-thinking
69.24s
gpt-oss-120b
75.48s
gemini-3-pro-preview
76.11s
llama-3.3-nemotron-super-49b-v1.5
76.48s
Qwen3-next-80b-a3b-thinking
77.76s
Kimi-K2-0905
82.80s
gemini-2.5-pro
86.80s
Mistral-large-2512
89.96s
gpt-5-mini
93.49s
gpt-5-nano
99.62s
Qwen3-235B-A22B-2507
104.78s
claude-haiku-4.5
110.95s
Olmo-3.1-32b-think
122.42s
Deepseek-v3.2
124.57s
Gpt-5.2
130.10s
Minimax-m2
136.96s
Claude-opus-4.5
144.01s
GLM-4.5-Air
163.15s
claude-sonnet-4.5
169.73s
deepSeek-R1-0528
171.50s
grok-4
180.11s
GLM-4.6
187.43s
gpt-5.1
227.43s
Kimi-K2-thinking
247.97s
Gpt-5.2-pro
261.38s
Deepseek-v3.2-speciale
310.39s
Qwen3-235B-A22B-Thinking-2507
316.82s

Benchmarks

Overall Comparison Costs Latency P99 Latency Domain

Evaluate model performance across 10 specialized knowledge domains with normalized scores. Identify which models excel in specific areas—whether you need strong coding capabilities, creative writing prowess, or expertise in science and mathematics. Average scores reflect overall domain competency.

Model	Average (All Topics)	Coding	Creative Writing	Current News	General Culture	Grammar	History	Logics	Math	Science	Technology
Gpt-5.2-pro	4.48 (#1)	4.37 (#1)	4.49 (#2)	4.56 (#1)	4.69 (#2)	4.52 (#1)	4.59 (#2)	4.32 (#1)	4.29 (#2)	4.39 (#10)	4.55 (#3)
Claude-opus-4.5	4.39 (#4)	4.33 (#2)	4.37 (#5)	4.17 (#24)	4.65 (#4)	4.48 (#3)	4.5 (#7)	4.14 (#4)	4.21 (#5)	4.57 (#1)	4.52 (#6)
Gpt-5.2	4.43 (#2)	4.3 (#3)	4.44 (#3)	4.54 (#2)	4.4 (#21)	4.37 (#6)	4.6 (#1)	4.18 (#2)	4.26 (#3)	4.55 (#2)	4.59 (#1)
gpt-oss-120b	4.18 (#14)	4.26 (#4)	4.13 (#18)	4.05 (#32)	4.21 (#32)	4.04 (#25)	4.34 (#21)	3.81 (#13)	3.89 (#14)	4.45 (#7)	4.51 (#8)
gemini-3-pro-preview	4.41 (#3)	4.23 (#5)	4.35 (#7)	4.51 (#4)	4.7 (#1)	4.2 (#15)	4.56 (#3)	4.12 (#5)	4.3 (#1)	4.51 (#3)	4.48 (#11)
gpt-5.1	4.38 (#5)	4.19 (#6)	4.34 (#8)	4.46 (#6)	4.68 (#3)	4.5 (#2)	4.44 (#12)	4.08 (#6)	4.13 (#6)	4.49 (#4)	4.49 (#10)
gpt-5-mini	4.29 (#10)	4.18 (#7)	4.39 (#4)	4.45 (#8)	4.48 (#12)	4.29 (#10)	4.29 (#24)	3.9 (#9)	4.05 (#9)	4.32 (#12)	4.52 (#7)
claude-sonnet-4.5	4.3 (#8)	4.12 (#8)	4.5 (#1)	4.48 (#5)	4.57 (#8)	4.44 (#4)	4.42 (#13)	3.57 (#21)	3.87 (#17)	4.48 (#5)	4.55 (#4)
grok-4	4.2 (#12)	4.11 (#9)	4.2 (#12)	4.23 (#19)	4.26 (#29)	4.2 (#14)	4.34 (#22)	4.18 (#3)	3.81 (#20)	4.3 (#14)	4.33 (#21)
Deepseek-v3.2-speciale	4.14 (#17)	4.1 (#10)	3.72 (#32)	4.17 (#23)	4.29 (#26)	4.24 (#13)	4.38 (#18)	3.56 (#22)	3.96 (#12)	4.34 (#11)	4.31 (#23)
Kimi-K2-thinking	4.32 (#6)	4.03 (#11)	4.37 (#6)	4.46 (#7)	4.6 (#6)	4.28 (#11)	4.54 (#4)	3.81 (#14)	4.13 (#7)	4.23 (#21)	4.56 (#2)
gemini-2.5-pro	4.29 (#9)	4.02 (#12)	4.15 (#16)	4.37 (#13)	4.43 (#17)	4.37 (#5)	4.4 (#14)	4.02 (#7)	4.24 (#4)	4.45 (#8)	4.38 (#19)
gemini-2.5-flash	4.17 (#15)	4.01 (#13)	4.05 (#21)	4.37 (#14)	4.21 (#31)	4.31 (#8)	4.29 (#26)	3.83 (#12)	4.02 (#10)	4.12 (#29)	4.44 (#15)
gpt-5-nano	4.06 (#22)	3.98 (#14)	3.85 (#27)	4.28 (#17)	4.29 (#27)	3.91 (#29)	4.18 (#32)	3.61 (#20)	3.87 (#16)	4.23 (#22)	4.23 (#31)
GLM-4.6	4.13 (#18)	3.95 (#15)	4.15 (#15)	4.11 (#30)	4.47 (#13)	4.11 (#19)	4.29 (#25)	3.45 (#25)	4.01 (#11)	4.29 (#15)	4.33 (#22)
Gpt-oss-20b	3.78 (#35)	3.93 (#16)	3.82 (#30)	3.41 (#38)	3.78 (#38)	3.52 (#35)	3.4 (#38)	3.87 (#11)	3.82 (#18)	4.18 (#25)	4.07 (#36)
gemini-2.5-flash-lite	3.95 (#28)	3.92 (#17)	4.09 (#19)	4.2 (#21)	4.15 (#34)	3.78 (#31)	4.37 (#19)	3.27 (#30)	3.33 (#31)	4.05 (#30)	4.29 (#25)
Gemini-3-flash-preview	4.3 (#7)	3.91 (#18)	4.25 (#10)	4.42 (#11)	4.58 (#7)	4.34 (#7)	4.54 (#5)	4.02 (#8)	4.06 (#8)	4.44 (#9)	4.45 (#13)
claude-haiku-4.5	4.17 (#16)	3.9 (#19)	4.18 (#14)	4.42 (#10)	4.52 (#10)	4.3 (#9)	4.48 (#8)	3.74 (#15)	3.53 (#27)	4.16 (#27)	4.54 (#5)
grok-4.1-fast-thinking	4.21 (#11)	3.88 (#20)	4.2 (#13)	4.33 (#15)	4.43 (#18)	4.18 (#16)	4.38 (#16)	3.71 (#17)	3.89 (#15)	4.48 (#6)	4.39 (#18)
Nemotron-3-nano-30b-a3b	4.03 (#25)	3.88 (#21)	3.63 (#34)	4.14 (#27)	4.29 (#25)	3.8 (#30)	4.18 (#31)	3.68 (#19)	3.81 (#19)	4.19 (#24)	4.49 (#9)
Deepseek-v3.2	4.11 (#20)	3.74 (#22)	4 (#22)	4.51 (#3)	4.44 (#15)	4.14 (#17)	4.32 (#23)	3.48 (#24)	3.72 (#23)	4.2 (#23)	4.39 (#17)
Nova-2-lite-v1	4.06 (#23)	3.74 (#23)	3.88 (#26)	4.25 (#18)	4.26 (#30)	4.08 (#22)	4.38 (#17)	3.04 (#35)	3.64 (#26)	4.28 (#17)	4.34 (#20)
deepSeek-R1-0528	4.12 (#19)	3.71 (#24)	4.07 (#20)	4.37 (#12)	4.44 (#16)	4.11 (#20)	4.45 (#10)	3.55 (#23)	3.65 (#25)	4.31 (#13)	4.43 (#16)
Qwen3-next-80b-a3b-thinking	4.03 (#24)	3.7 (#25)	3.74 (#31)	4.05 (#34)	4.4 (#20)	4.13 (#18)	4.22 (#29)	3.73 (#16)	3.79 (#21)	4.24 (#20)	4.2 (#33)
Qwen3-235B-A22B-Thinking-2507	4.2 (#13)	3.7 (#26)	4.22 (#11)	4.19 (#22)	4.57 (#9)	4.24 (#12)	4.44 (#11)	3.88 (#10)	3.78 (#22)	4.26 (#19)	4.46 (#12)
Qwen3-235B-A22B-2507	3.98 (#27)	3.65 (#27)	3.84 (#29)	4.05 (#33)	4.32 (#23)	3.96 (#28)	4.26 (#27)	3.27 (#29)	3.7 (#24)	4.29 (#16)	4.27 (#27)
Kimi-K2-0905	4.11 (#21)	3.61 (#28)	4.26 (#9)	4.45 (#9)	4.65 (#5)	4.07 (#23)	4.51 (#6)	3.4 (#27)	3.4 (#30)	4.28 (#18)	4.44 (#14)
Mistral-large-2512	3.94 (#29)	3.54 (#29)	3.89 (#25)	4.13 (#28)	4.45 (#14)	4.11 (#21)	4.39 (#15)	3.41 (#26)	3.11 (#32)	4.01 (#32)	4.3 (#24)
Minimax-m2	3.99 (#26)	3.46 (#30)	3.65 (#33)	4.33 (#16)	4.3 (#24)	4.01 (#27)	4.2 (#30)	3.34 (#28)	3.94 (#13)	4.14 (#28)	4.29 (#26)
GLM-4.5-Air	3.86 (#31)	3.4 (#31)	3.63 (#35)	4.04 (#35)	4.4 (#19)	3.61 (#33)	4.16 (#33)	3.69 (#18)	3.47 (#29)	3.9 (#34)	4.25 (#29)
Olmo-3.1-32b-think	3.85 (#32)	3.31 (#32)	3.93 (#24)	4.15 (#26)	4.29 (#28)	3.53 (#34)	4.16 (#34)	3.26 (#31)	3.48 (#28)	4.01 (#31)	4.26 (#28)
llama-3.3-nemotron-super-49b-v1.5	3.78 (#34)	3.29 (#33)	3.85 (#28)	4.13 (#29)	4.35 (#22)	3.62 (#32)	4.26 (#28)	3.11 (#34)	3.1 (#33)	3.9 (#35)	4.14 (#34)
grok-4.1-fast	3.88 (#30)	3.27 (#34)	4.14 (#17)	4.15 (#25)	4.49 (#11)	4.04 (#24)	4.35 (#20)	3.17 (#33)	2.84 (#37)	4.16 (#26)	4.22 (#32)
nemotron-nano-9b-v2	3.5 (#37)	3.07 (#35)	3 (#38)	3.92 (#36)	4.04 (#36)	2.97 (#38)	3.92 (#37)	2.86 (#37)	3.07 (#34)	3.78 (#37)	4.01 (#37)
Mistral-medium-3.1	3.81 (#33)	3.02 (#36)	3.94 (#23)	4.21 (#20)	4.07 (#35)	4.03 (#26)	4.47 (#9)	3.2 (#32)	2.89 (#35)	3.98 (#33)	4.25 (#30)
Ministral-8b-2512	3.57 (#36)	2.95 (#37)	3.57 (#36)	4.1 (#31)	4.16 (#33)	3.24 (#37)	3.97 (#36)	2.94 (#36)	2.78 (#38)	3.78 (#38)	4.1 (#35)
nova-premier-v1	3.47 (#38)	2.84 (#38)	3.55 (#37)	3.73 (#37)	3.96 (#37)	3.34 (#36)	3.98 (#35)	2.55 (#38)	2.85 (#36)	3.82 (#36)	3.98 (#38)

HOW IT WORKS

AutoBench operates through a fully automated, iterative process designed for robustness and statistical significance.

01
Submit Models
Point to GPT-4o, Claude, Gemini, Llama 3—or your own private endpoint—and specify the subject areas you care about.
02
Run Benchmarks
The engine writes difficulty-balanced prompts, solicits answers from each model, and quality-checks every response automatically.
03
Collect Metrics
Every model anonymously judges its peers; a weighting algorithm refines scores until the leaderboard stabilises.
04
Analyze Results
Download a ready-to-share CSV plus an interactive dashboard that plugs into Hugging Face Spaces or your internal BI tools.

Validation: Proven Accuracy at Scale

AutoBench’s effectiveness is not theoretical. The results from its public runs demonstrate both unprecedented scale and exceptionally high correlation with industry-standard benchmarks

87.08%AAII
77.16%LMArena
80.68%MMLU-Plus

32ranked models
16ranking models
320kranks

10512iterations
336kanswers

≈300Mgenerated tokens
≈8Bevaluated tokens

AutoBench your models today

We invite you to explore the code, run the benchmark, contribute to its development, and join the discussion on the future of LLM evaluation. Explore our resources on Hugging Face.

AutoBench for Enterprises and LLM Labs

For Enterprises with Large-Scale LLM Consumption

Benchmark your Use Cases

Large corporations project billions in LLM API calls, but relying on a single model for all tasks leads to massive inefficiencies. AutoBench evaluates models on your internal use cases and data, identifying the optimal model for tasks like sentiment analysis, document summarization, or customer support.

See $ Trade-Offs Instantly

Gain immediate visibility into cost-quality trade-offs. By analyzing performance metrics like average answer cost and P99 duration, AutoBench reveals how switching models can save an estimated 20%+ on LLM expenditure without sacrificing quality. Public runs show 77% correlation with human preference (LMArena) and 87% with AAII, ensuring external validation.

Switch & Monitor

Seamlessly switch to cost-effective models and monitor ongoing performance. Our enterprise-specific benchmarks ensure continuous optimization, preventing overpayments and improving reliability in high-volume AI deployments.

Empowering LLM Lab R&D

Benchmark your Use Cases

With over 20 major labs competing and a $50M TAM for R&D enablement in 2025, granular evaluation is critical. AutoBench offers private, domain-focused benchmarks to reveal weaknesses in areas like advanced reasoning or specific coding.

See $ Trade-Offs Instantly

Get instant, nuanced views of performance trade-offs through collective LLM judging. Backed by ~300,000 ranks and high correlations (e.g., 77% with human preference and 87% with AAII), it provides actionable data to refine models efficiently.

Switch & Monitor

Monitor progress and switch training strategies with ease. Our scalable framework supports continuous custom runs, helping labs adapt architectures and data for better outcomes in the intensifying AI arms race.

Frequently Asked Questions

Still have doubts? These fast answers clear up the most common concerns about bringing AutoBench into your workflow.

Delivering Transparency in LLM Benchmarking

Leaderboards

QUALITY

PRICE

LATENCY

Benchmarks

AutoBench your models today

AutoBench for Enterprises and LLM Labs

For Enterprises with Large-Scale LLM Consumption

Benchmark your Use Cases

See $ Trade-Offs Instantly

Switch & Monitor

Empowering LLM Lab R&D

Benchmark your Use Cases

See $ Trade-Offs Instantly

Switch & Monitor

Let's talk now!

Delivering Transparency
in LLM Benchmarking