AutoBench Run 5 - December 2025

Latest AutoBench run with models Gpt 5.2, Claude Opus 4.5, Gemini 3 Flash and more

Latest

Date

December 19, 2025

Version

2025-12-19

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Comparison of AutoBench scores with other popular benchmarks. AutoBench features 87.08% correlation with Artificial Analysis Intelligence Index, 77.16% with LMArena (Chatbot Arena), and 80.68% with MMLU-Plus. Models sorted by AutoBench score.

Model	AutoBench	Chatbot Ar.	AAI Index	MMLU Index
Gpt-5.2-pro	4.48 (#1)	-	73 (#2)	0.87 (#6)
gemini-3-pro-preview	4.41 (#3)	1492 (#1)	73 (#1)	0.9 (#2)
Gemini-3-flash-preview	4.3 (#7)	-	71 (#3)	0.89 (#3)
Claude-opus-4.5	4.39 (#4)	1470 (#3)	70 (#4)	0.9 (#1)
gpt-5.1	4.38 (#5)	1457 (#4)	70 (#5)	0.87 (#5)
Kimi-K2-thinking	4.32 (#6)	1429 (#7)	67 (#6)	0.85 (#12)
grok-4	4.2 (#12)	1478 (#2)	65 (#7)	0.87 (#7)
gpt-5-mini	4.29 (#10)	1392 (#18)	64 (#9)	0.84 (#15)
grok-4.1-fast-thinking	4.21 (#11)	-	64 (#8)	0.85 (#11)
claude-sonnet-4.5	4.3 (#8)	1450 (#6)	63 (#10)	0.88 (#4)
Minimax-m2	3.99 (#26)	1345 (#24)	61 (#11)	0.82 (#21)
gpt-oss-120b	4.18 (#14)	1352 (#23)	61 (#12)	0.81 (#24)
gemini-2.5-pro	4.29 (#9)	1451 (#5)	60 (#13)	0.86 (#9)
Deepseek-v3.2-speciale	4.14 (#17)	1418 (#9)	59 (#14)	0.86 (#8)
Qwen3-235B-A22B-Thinking-2507	4.2 (#13)	1397 (#16)	57 (#15)	0.84 (#16)
GLM-4.6	4.13 (#18)	1425 (#8)	56 (#16)	0.83 (#17)
claude-haiku-4.5	4.17 (#16)	1402 (#15)	55 (#17)	0.76 (#30)
Qwen3-next-80b-a3b-thinking	4.03 (#24)	1367 (#22)	54 (#18)	0.82 (#22)
Deepseek-v3.2	4.11 (#20)	1414 (#12)	52 (#20)	0.84 (#13)
Gpt-oss-20b	3.78 (#35)	1318 (#28)	52 (#22)	0.75 (#31)
Nemotron-3-nano-30b-a3b	4.03 (#25)	-	52 (#21)	0.79 (#28)
deepSeek-R1-0528	4.12 (#19)	1395 (#17)	52 (#19)	0.85 (#10)
gemini-2.5-flash	4.17 (#15)	1408 (#14)	51 (#23)	0.84 (#14)
gpt-5-nano	4.06 (#22)	1339 (#26)	51 (#24)	0.77 (#29)
Kimi-K2-0905	4.11 (#21)	1416 (#10)	50 (#25)	0.82 (#20)
GLM-4.5-Air	3.86 (#31)	1370 (#21)	49 (#26)	0.82 (#19)
Nova-2-lite-v1	4.06 (#23)	1334 (#27)	47 (#27)	0.81 (#27)
Qwen3-235B-A22B-2507	3.98 (#27)	1374 (#20)	45 (#28)	0.83 (#18)
llama-3.3-nemotron-super-49b-v1.5	3.78 (#34)	1340 (#25)	45 (#29)	0.81 (#25)
gemini-2.5-flash-lite	3.95 (#28)	1378 (#19)	40 (#30)	0.81 (#23)
Mistral-large-2512	3.94 (#29)	1415 (#11)	38 (#31)	0.81 (#26)
grok-4.1-fast	3.88 (#30)	-	38 (#32)	0.74 (#32)
nemotron-nano-9b-v2	3.5 (#37)	-	37 (#33)	0.74 (#33)
Mistral-medium-3.1	3.81 (#33)	1411 (#13)	35 (#34)	0.68 (#35)
nova-premier-v1	3.47 (#38)	-	32 (#35)	0.73 (#34)
Ministral-8b-2512	3.57 (#36)	-	28 (#36)	0.64 (#36)
Gpt-5.2	4.43 (#2)	-	-	-
Olmo-3.1-32b-think	3.85 (#32)	-	-	-