AutoBench Agronomy LLM Benchmark - December 2025

The first AutoBench run for the Agronomy domain with models Gemini 3 Pro, Gpt 5.1, Grok 4.1, Opus 4.5 and more

Past

Date

December 10, 2025

Version

2025-12-10

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Average latency breakdown across 10 specialized domains, measured in seconds. These metrics help you identify models that consistently deliver quick responses for your specific use case while maintaining quality standards.

Model	Average (All Topics)
Nova lite v1	6.53s (#1)
Magistral small 2506	7.51s (#2)
Nova pro v1	7.84s (#3)
Gemini 2.5 flash lite	10.98s (#4)
Llama 4 Maverick	12.09s (#5)
Claude 3.5 haiku	12.37s (#6)
Phi 4	14.87s (#7)
Llama 4 scout	15.16s (#8)
Gemini 2.5 flash	16.98s (#9)
Nemotron nano 9b v2	17.50s (#10)
Phi 3 mini 128k instruct	19.89s (#11)
Kimi K2 Instruct	21.11s (#12)
Qwen3 30b a3b instruct 2507	21.87s (#13)
Grok 3 mini	23.30s (#14)
Grok 4.1 fast	24.09s (#15)
DeepSeek V3 0324	26.09s (#16)
Deepseek v3.1	29.33s (#17)
Gemma 3 27b it	30.64s (#18)
Qwen3 next 80b a3b thinking	32.19s (#19)
Gpt oss 120b	34.63s (#20)
GLM 4.5 Air	35.26s (#21)
Llama 3.3 nemotron super 49b v1.5	35.56s (#22)
Llama 3.1 nemotron ultra 253b v1	35.68s (#23)
Claude sonnet 4.5	42.23s (#24)
Grok 4.1 fast thinking	45.41s (#25)
Gemini 3 pro preview	46.15s (#26)
Gemini 2.5 pro	50.43s (#27)
GLM 4.5	50.84s (#28)
Claude haiku 4.5	52.84s (#29)
DeepSeek R1 0528	53.70s (#30)
Mistral large 2512	61.60s (#31)
Claude opus 4.5	66.00s (#32)
Kimi k2 thinking	68.03s (#33)
Minimax m2	68.36s (#34)
Grok 4	70.41s (#35)
Deepseek v3.2 exp	71.34s (#36)
Qwen3 235B A22B Thinking 2507	74.18s (#37)
Gpt 5 mini	74.34s (#38)
Gpt 5	112.19s (#39)
Gpt 5.1	140.66s (#40)