AutoBench Run 3 - August 2025

Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates

Past

Date

August 14, 2025

Version

2025-08-14

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Comparison of AutoBench scores with other popular benchmarks. AutoBench features 87.08% correlation with Artificial Analysis Intelligence Index, 77.16% with LMArena (Chatbot Arena), and 80.68% with MMLU-Plus. Models sorted by AutoBench score.

Model	AutoBench	Chatbot Ar.	AAI Index	MMLU Index
gpt-5	4.51 (#1)	1481 (#1)	68950 (#1)	0.871 (#1)
gemini-2.5-pro	4.42 (#4)	1458 (#2)	64630 (#5)	0.862 (#3)
o3	4.41 (#5)	1451 (#3)	67070 (#3)	0.853 (#4)
claude-opus-4-1	4.24 (#11)	1446 (#4)	58830 (#10)	-
grok-4	4.31 (#9)	1430 (#5)	67520 (#2)	0.866 (#2)
Kimi-K2-Instruct	4.18 (#13)	1420 (#6)	48560 (#17)	0.824 (#14)
deepSeek-R1-0528	4.18 (#12)	1418 (#7)	58740 (#11)	0.849 (#5)
GLM-4.5	4.18 (#14)	1414 (#8)	56080 (#14)	0.835 (#8)
gemini-2.5-flash	4.32 (#8)	1409 (#9)	58430 (#12)	0.759 (#23)
gpt-4.1	4.17 (#16)	1406 (#10)	46770 (#18)	0.806 (#19)
Qwen3-235B-A22B-Thinking-2507	4.39 (#6)	1401 (#11)	63590 (#7)	0.843 (#6)
claude-sonnet-4	4.17 (#15)	1399 (#12)	61000 (#9)	0.842 (#7)
o4-mini	4.27 (#10)	1398 (#13)	65050 (#4)	0.832 (#10)
deepSeek-V3-0324	3.95 (#23)	1390 (#14)	43990 (#22)	0.819 (#15)
Qwen3-30B-A3B	3.95 (#22)	1380 (#15)	42340 (#23)	0.777 (#20)
GLM-4.5-Air	3.98 (#20)	1379 (#16)	49475 (#16)	0.815 (#16)
gemma-3-27b-it	3.88 (#25)	1363 (#17)	25220 (#31)	0.669 (#30)
grok-3-mini	4.06 (#17)	1360 (#18)	58010 (#13)	0.828 (#12)
gpt-oss-120b	4.48 (#3)	1356 (#19)	61340 (#8)	0.808 (#18)
gemini-2.5-flash-lite	4.02 (#19)	1351 (#20)	44348 (#21)	0.832 (#9)
magistral-small-2506	3.71 (#27)	1347 (#21)	35950 (#26)	0.746 (#25)
llama-3_1-Nemotron-Ultra-253B-v1	4.02 (#18)	1345 (#22)	46420 (#19)	0.825 (#13)
llama-4-maverick	3.64 (#29)	1330 (#23)	41730 (#24)	0.809 (#17)
Llama-3_3-Nemotron-Super-49B-v1	3.88 (#24)	1324 (#24)	40473 (#25)	0.698 (#27)
llama-4-Scout-17B-16E-Instruct	3.61 (#30)	1318 (#25)	33060 (#27)	0.752 (#24)
claude-3.5-haiku	3.59 (#31)	1317 (#26)	23326 (#33)	0.634 (#31)
mistral-large-2411	3.71 (#26)	1313 (#27)	27013 (#30)	0.697 (#28)
nova-pro-v1	3.49 (#33)	1289 (#28)	28830 (#28)	0.691 (#29)
nova-lite-v1	3.54 (#32)	1262 (#29)	24540 (#32)	0.59 (#32)
phi-4	3.66 (#28)	1258 (#30)	27950 (#29)	0.714 (#26)
Qwen3-14B	3.98 (#21)	-	45235 (#20)	0.774 (#21)
gpt-5-mini	4.49 (#2)	-	63700 (#6)	0.828 (#11)
gpt-5-nano	4.33 (#7)	-	53780 (#15)	0.772 (#22)