AutoBench Agronomy LLM Benchmark - December 2025

The first AutoBench run for the Agronomy domain with models Gemini 3 Pro, Gpt 5.1, Grok 4.1, Opus 4.5 and more

Past

Date

December 10, 2025

Version

2025-12-10

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Average latency breakdown across 10 specialized domains, measured in seconds. These metrics help you identify models that consistently deliver quick responses for your specific use case while maintaining quality standards.

Model	Average (All Topics)
Gpt 5.1	140.66s (#40)
Gpt 5	112.19s (#39)
Gpt 5 mini	74.34s (#38)
Qwen3 235B A22B Thinking 2507	74.18s (#37)
Deepseek v3.2 exp	71.34s (#36)
Grok 4	70.41s (#35)
Minimax m2	68.36s (#34)
Kimi k2 thinking	68.03s (#33)
Claude opus 4.5	66.00s (#32)
Mistral large 2512	61.60s (#31)
DeepSeek R1 0528	53.70s (#30)
Claude haiku 4.5	52.84s (#29)
GLM 4.5	50.84s (#28)
Gemini 2.5 pro	50.43s (#27)
Gemini 3 pro preview	46.15s (#26)
Grok 4.1 fast thinking	45.41s (#25)
Claude sonnet 4.5	42.23s (#24)
Llama 3.1 nemotron ultra 253b v1	35.68s (#23)
Llama 3.3 nemotron super 49b v1.5	35.56s (#22)
GLM 4.5 Air	35.26s (#21)
Gpt oss 120b	34.63s (#20)
Qwen3 next 80b a3b thinking	32.19s (#19)
Gemma 3 27b it	30.64s (#18)
Deepseek v3.1	29.33s (#17)
DeepSeek V3 0324	26.09s (#16)
Grok 4.1 fast	24.09s (#15)
Grok 3 mini	23.30s (#14)
Qwen3 30b a3b instruct 2507	21.87s (#13)
Kimi K2 Instruct	21.11s (#12)
Phi 3 mini 128k instruct	19.89s (#11)
Nemotron nano 9b v2	17.50s (#10)
Gemini 2.5 flash	16.98s (#9)
Llama 4 scout	15.16s (#8)
Phi 4	14.87s (#7)
Claude 3.5 haiku	12.37s (#6)
Llama 4 Maverick	12.09s (#5)
Gemini 2.5 flash lite	10.98s (#4)
Nova pro v1	7.84s (#3)
Magistral small 2506	7.51s (#2)
Nova lite v1	6.53s (#1)