Submit Models
Point to GPT-4o, Claude, Gemini, Llama 3—or your own private endpoint—and specify the subject areas you care about.
We use multi-LLM evaluation for accurate and unbiased evaluation of LLM quality, cost and speed. AutoBench resists gaming by changing at each run.
Our system uses 20+ LLM models to generate granular benchmarks that score 90%+ correlation with AAII and 80%+ with LMArena.
Average Score combining domain-specific Autobench scores; Higher is better
USD cent per average answer; Lower is better
Average Latency in Seconds; Lower is better
Model | Average (All Topics) | Coding | Creative Writing | Current News | General Culture | Grammar | History | Logics | Math | Science | Technology |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 4.48 (#3) | 4.61 (#1) | 4.42 (#6) | 4.52 (#3) | 4.42 (#7) | 4.45 (#2) | 4.57 (#3) | 4.16 (#3) | 4.25 (#2) | 4.63 (#2) | 4.63 (#2) | |
| 4.51 (#1) | 4.58 (#2) | 4.52 (#3) | 4.59 (#1) | 4.62 (#1) | 4.36 (#7) | 4.64 (#1) | 4.21 (#1) | 4.17 (#5) | 4.65 (#1) | 4.66 (#1) | |
| 4.49 (#2) | 4.54 (#3) | 4.54 (#1) | 4.5 (#4) | 4.52 (#3) | 4.44 (#4) | 4.56 (#4) | 4.18 (#2) | 4.25 (#1) | 4.62 (#3) | 4.63 (#3) | |
| 4.42 (#4) | 4.52 (#4) | 4.39 (#7) | 4.42 (#6) | 4.49 (#5) | 4.45 (#1) | 4.48 (#7) | 4.09 (#4) | 4.24 (#3) | 4.52 (#5) | 4.52 (#7) | |
| 4.41 (#5) | 4.43 (#5) | 4.3 (#13) | 4.56 (#2) | 4.5 (#4) | 4.39 (#5) | 4.57 (#2) | 3.96 (#7) | 4.16 (#6) | 4.58 (#4) | 4.61 (#4) | |
| 4.32 (#8) | 4.42 (#6) | 4.17 (#21) | 4.37 (#8) | 4.33 (#14) | 4.38 (#6) | 4.42 (#11) | 4.01 (#6) | 4.23 (#4) | 4.43 (#10) | 4.39 (#15) | |
| 4.33 (#7) | 4.41 (#7) | 4.38 (#8) | 4.35 (#10) | 4.4 (#10) | 4.3 (#11) | 4.45 (#9) | 3.88 (#9) | 4.13 (#7) | 4.37 (#15) | 4.52 (#8) | |
| 4.39 (#6) | 4.31 (#8) | 4.44 (#5) | 4.48 (#5) | 4.56 (#2) | 4.45 (#3) | 4.51 (#5) | 3.84 (#10) | 3.94 (#8) | 4.44 (#9) | 4.54 (#5) | |
| 4.31 (#9) | 4.31 (#9) | 4.35 (#10) | 4.33 (#12) | 4.38 (#11) | 4.34 (#8) | 4.4 (#13) | 4.01 (#5) | 3.85 (#9) | 4.41 (#12) | 4.4 (#12) | |
| 4.27 (#10) | 4.31 (#10) | 4.24 (#18) | 4.36 (#9) | 4.38 (#12) | 4.26 (#12) | 4.35 (#15) | 3.9 (#8) | 3.84 (#10) | 4.48 (#6) | 4.53 (#6) | |
| 4.24 (#11) | 4.29 (#11) | 4.51 (#4) | 4.3 (#13) | 4.43 (#6) | 4.23 (#14) | 4.44 (#10) | 3.57 (#12) | 3.58 (#13) | 4.42 (#11) | 4.48 (#10) | |
| 4.17 (#16) | 4.24 (#12) | 4.32 (#11) | 4.26 (#16) | 4.19 (#19) | 4.21 (#15) | 4.23 (#18) | 3.74 (#11) | 3.79 (#11) | 4.27 (#16) | 4.32 (#16) | |
| 4.17 (#15) | 4.19 (#13) | 4.36 (#9) | 4.3 (#14) | 4.33 (#15) | 4.25 (#13) | 4.35 (#14) | 3.55 (#14) | 3.48 (#17) | 4.4 (#13) | 4.39 (#14) | |
| 4.18 (#13) | 4.12 (#14) | 4.54 (#2) | 4.29 (#15) | 4.35 (#13) | 4.19 (#16) | 4.5 (#6) | 3.4 (#21) | 3.52 (#15) | 4.46 (#8) | 4.41 (#11) | |
| 4.02 (#19) | 4.11 (#15) | 4.15 (#23) | 4.08 (#23) | 4.17 (#21) | 4.05 (#19) | 4.16 (#22) | 3.34 (#23) | 3.55 (#14) | 4.24 (#19) | 4.19 (#21) | |
| 4.06 (#17) | 4.02 (#16) | 4.18 (#20) | 4.16 (#19) | 4.21 (#17) | 4.16 (#17) | 4.21 (#19) | 3.51 (#15) | 3.49 (#16) | 4.26 (#17) | 4.26 (#18) | |
| 4.18 (#12) | 3.95 (#17) | 4.31 (#12) | 4.35 (#11) | 4.41 (#9) | 4.31 (#10) | 4.4 (#12) | 3.56 (#13) | 3.63 (#12) | 4.39 (#14) | 4.4 (#13) | |
| 4.18 (#14) | 3.89 (#18) | 4.26 (#17) | 4.38 (#7) | 4.42 (#8) | 4.32 (#9) | 4.47 (#8) | 3.48 (#17) | 3.47 (#18) | 4.48 (#7) | 4.49 (#9) | |
| 3.95 (#23) | 3.88 (#19) | 4.19 (#19) | 4.07 (#24) | 4.06 (#25) | 4.05 (#20) | 4.09 (#25) | 3.37 (#22) | 3.44 (#19) | 4.14 (#23) | 4.1 (#25) | |
| 3.88 (#24) | 3.83 (#20) | 4.05 (#24) | 4.04 (#25) | 4.12 (#23) | 3.99 (#23) | 4.16 (#21) | 3.04 (#29) | 3.27 (#25) | 4.12 (#25) | 4.13 (#24) | |
| 3.98 (#21) | 3.83 (#21) | 4.28 (#15) | 4.11 (#20) | 4.09 (#24) | 4.02 (#22) | 4.15 (#23) | 3.44 (#19) | 3.34 (#22) | 4.23 (#20) | 4.16 (#22) | |
| 3.98 (#20) | 3.8 (#22) | 3.99 (#25) | 4.19 (#17) | 4.19 (#18) | 4.03 (#21) | 4.29 (#16) | 3.42 (#20) | 3.34 (#23) | 4.25 (#18) | 4.27 (#17) | |
| 3.95 (#22) | 3.8 (#23) | 4.17 (#22) | 4.09 (#22) | 4.16 (#22) | 3.94 (#25) | 4.14 (#24) | 3.49 (#16) | 3.39 (#21) | 4.13 (#24) | 4.15 (#23) | |
| 4.02 (#18) | 3.77 (#24) | 4.27 (#16) | 4.18 (#18) | 4.22 (#16) | 4.09 (#18) | 4.26 (#17) | 3.47 (#18) | 3.43 (#20) | 4.23 (#21) | 4.24 (#19) | |
| 3.71 (#27) | 3.74 (#25) | 3.23 (#33) | 3.92 (#26) | 3.89 (#28) | 3.84 (#27) | 3.97 (#26) | 3.22 (#24) | 3.28 (#24) | 4 (#26) | 3.94 (#27) | |
| 3.64 (#29) | 3.59 (#26) | 3.74 (#32) | 3.7 (#31) | 3.78 (#31) | 3.83 (#28) | 3.82 (#30) | 3.13 (#25) | 3.1 (#28) | 3.83 (#29) | 3.79 (#31) | |
| 3.88 (#25) | 3.57 (#27) | 4.29 (#14) | 4.11 (#21) | 4.17 (#20) | 3.97 (#24) | 4.18 (#20) | 3.04 (#30) | 3.08 (#29) | 4.19 (#22) | 4.2 (#20) | |
| 3.71 (#26) | 3.5 (#28) | 3.97 (#26) | 3.83 (#27) | 3.93 (#27) | 3.8 (#29) | 3.92 (#27) | 3.11 (#27) | 3.11 (#27) | 3.99 (#27) | 3.95 (#26) | |
| 3.66 (#28) | 3.48 (#29) | 3.97 (#27) | 3.73 (#29) | 3.85 (#29) | 3.66 (#31) | 3.82 (#31) | 3.13 (#26) | 3.2 (#26) | 3.87 (#28) | 3.85 (#28) | |
| 3.59 (#31) | 3.47 (#30) | 3.86 (#29) | 3.74 (#28) | 4 (#26) | 3.74 (#30) | 3.87 (#28) | 2.82 (#33) | 2.78 (#33) | 3.74 (#33) | 3.81 (#30) | |
| 3.61 (#30) | 3.37 (#31) | 3.86 (#28) | 3.66 (#32) | 3.83 (#30) | 3.85 (#26) | 3.84 (#29) | 3.05 (#28) | 3 (#30) | 3.82 (#31) | 3.84 (#29) | |
| 3.49 (#33) | 3.36 (#32) | 3.84 (#30) | 3.55 (#33) | 3.72 (#33) | 3.53 (#32) | 3.63 (#33) | 2.96 (#32) | 2.85 (#32) | 3.75 (#32) | 3.61 (#33) | |
| 3.54 (#32) | 3.32 (#33) | 3.78 (#31) | 3.71 (#30) | 3.77 (#32) | 3.51 (#33) | 3.76 (#32) | 2.99 (#31) | 2.95 (#31) | 3.82 (#30) | 3.75 (#32) |
AutoBench operates through a fully automated, iterative process designed for robustness and statistical significance.
Point to GPT-4o, Claude, Gemini, Llama 3—or your own private endpoint—and specify the subject areas you care about.
The engine writes difficulty-balanced prompts, solicits answers from each model, and quality-checks every response automatically.
Every model anonymously judges its peers; a weighting algorithm refines scores until the leaderboard stabilises.
Download a ready-to-share CSV plus an interactive dashboard that plugs into Hugging Face Spaces or your internal BI tools.
AutoBench’s effectiveness is not theoretical. The results from its public runs demonstrate both unprecedented scale and exceptionally high correlation with industry-standard benchmarks
We invite you to explore the code, run the benchmark, contribute to its development, and join the discussion on the future of LLM evaluation. Explore our resources on Hugging Face.
Contact usLarge corporations project billions in LLM API calls, but relying on a single model for all tasks leads to massive inefficiencies. AutoBench evaluates models on your internal use cases and data, identifying the optimal model for tasks like sentiment analysis, document summarization, or customer support.
Gain immediate visibility into cost-quality trade-offs. By analyzing performance metrics like average answer cost and P99 duration, AutoBench reveals how switching models can save an estimated 20%+ on LLM expenditure without sacrificing quality.
Seamlessly switch to cost-effective models and monitor ongoing performance. Our enterprise-specific benchmarks ensure continuous optimization, preventing overpayments and improving reliability in high-volume AI deployments.
With over 20 major labs competing and a $50M TAM for R&D enablement in 2025, granular evaluation is critical. AutoBench offers private, domain-focused benchmarks to reveal weaknesses in areas like advanced reasoning or specific coding.
Get instant, nuanced views of performance trade-offs through collective LLM judging. Backed by ~300,000 ranks and high correlations (e.g., 86.85% with human preference), it provides actionable data to refine models efficiently.
Monitor progress and switch training strategies with ease. Our scalable framework supports continuous custom runs, helping labs adapt architectures and data for better outcomes in the intensifying AI arms race.
Still have doubts? These fast answers clear up the most common concerns about bringing AutoBench into your workflow.