AutoBench Run 4 - November 2025

Latest AutoBench run with models Gemini 3 Pro, Gpt 5.1, Grok 4.1 and more

Past

Date

November 28, 2025

Version

2025-11-28

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Comparison of AutoBench scores with other popular benchmarks. AutoBench features 87.08% correlation with Artificial Analysis Intelligence Index, 77.16% with LMArena (Chatbot Arena), and 80.68% with MMLU-Plus. Models sorted by AutoBench score.

Model	AutoBench	Chatbot Ar.	AAI Index	MMLU Index
grok-4.1-fast	4.17 (#20)	1462 (#3)	-	-
gemma-3-27b-it	3.7 (#26)	1364 (#21)	22 (#31)	0.67 (#31)
phi-4	3.46 (#31)	1255 (#30)	23 (#30)	0.71 (#28)
nova-premier-v1	3.7 (#27)	-	25 (#29)	0.73 (#26)
llama-3.3-70B-Instruct	3.55 (#30)	1319 (#27)	28 (#28)	0.71 (#27)
mistral-small-3.2-24b-instruct	3.81 (#25)	1354 (#22)	29 (#27)	0.68 (#30)
nova-pro-v1	3.35 (#32)	1288 (#29)	32 (#26)	0.69 (#29)
magistral-medium-2506	3.95 (#24)	1305 (#28)	33 (#25)	0.82 (#17)
llama-4-maverick	3.61 (#28)	1327 (#26)	36 (#24)	0.81 (#21)
Qwen3-30B-A3B-Instruct-2507	4.21 (#19)	1382 (#18)	37 (#22)	0.78 (#22)
nemotron-nano-9b-v2	3.6 (#29)	-	37 (#23)	0.74 (#25)
Qwen3-235B-A22B-2507	4.24 (#15)	1374 (#20)	45 (#20)	0.83 (#15)
llama-3.3-nemotron-super-49b-v1.5	4.1 (#22)	1340 (#24)	45 (#21)	0.81 (#20)
gemini-2.5-flash-lite	4.22 (#17)	1380 (#19)	48 (#19)	0.81 (#18)
gpt-5-nano	4.32 (#8)	1338 (#25)	49 (#18)	0.77 (#23)
Kimi-K2-0905	4.21 (#18)	1416 (#12)	50 (#17)	0.82 (#16)
deepSeek-R1-0528	4.23 (#16)	1395 (#17)	52 (#16)	0.85 (#7)
gemini-2.5-flash	4.3 (#10)	1405 (#14)	54 (#15)	0.84 (#11)
claude-haiku-4.5	4.27 (#13)	1402 (#15)	55 (#14)	0.76 (#24)
GLM-4.6	4.25 (#14)	1426 (#10)	56 (#13)	0.83 (#13)
Qwen3-235B-A22B-Thinking-2507	4.28 (#11)	1397 (#16)	57 (#11)	0.84 (#12)
deepSeek-v3.2-exp	4.16 (#21)	1421 (#11)	57 (#10)	0.85 (#8)
grok-3-mini	4.08 (#23)	1410 (#13)	57 (#12)	0.83 (#14)
claude-opus-4-1	4.27 (#12)	1449 (#7)	59 (#9)	0.88 (#2)
gemini-2.5-pro	4.37 (#4)	1451 (#5)	60 (#8)	0.86 (#6)
gpt-oss-120b	4.37 (#5)	1352 (#23)	61 (#7)	0.81 (#19)
claude-sonnet-4.5	4.31 (#9)	1449 (#6)	63 (#6)	0.88 (#3)
grok-4.1-fast-thinking	4.34 (#6)	1481 (#2)	64 (#5)	0.85 (#9)
Kimi-K2-thinking	4.34 (#7)	1429 (#9)	67 (#4)	0.85 (#10)
gpt-5	4.45 (#2)	1437 (#8)	68 (#3)	0.87 (#4)
gpt-5.1	4.49 (#1)	1454 (#4)	70 (#2)	0.87 (#5)
gemini-3-pro-preview	4.39 (#3)	1495 (#1)	73 (#1)	0.9 (#1)