AutoBench Agronomy LLM Benchmark - December 2025

The first AutoBench run for the Agronomy domain with models Gemini 3 Pro, Gpt 5.1, Grok 4.1, Opus 4.5 and more

Past

Date

December 10, 2025

Version

2025-12-10

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Track the cost efficiency of LLM models across 10 specialized domains. Costs are measured in cents per response, helping you identify the most economical models for your specific use case.

Model	Average (All Topics)
Phi 4	0.01 (#1)
Nova lite v1	0.02 (#2)
Llama 4 scout	0.02 (#3)
Phi 3 mini 128k instruct	0.02 (#4)
Nemotron nano 9b v2	0.03 (#5)
Gemma 3 27b it	0.03 (#6)
Qwen3 30b a3b instruct 2507	0.03 (#7)
Llama 4 Maverick	0.05 (#8)
Gemini 2.5 flash lite	0.07 (#9)
DeepSeek V3 0324	0.07 (#10)
Gpt oss 120b	0.07 (#11)
Grok 4.1 fast thinking	0.07 (#12)
Grok 4.1 fast	0.08 (#13)
Deepseek v3.2 exp	0.08 (#14)
Deepseek v3.1	0.10 (#15)
Grok 3 mini	0.10 (#16)
Magistral small 2506	0.10 (#17)
Llama 3.3 nemotron super 49b v1.5	0.11 (#18)
Qwen3 235B A22B Thinking 2507	0.13 (#19)
GLM 4.5 Air	0.16 (#20)
Nova pro v1	0.16 (#21)
Llama 3.1 nemotron ultra 253b v1	0.21 (#22)
Kimi K2 Instruct	0.21 (#23)
DeepSeek R1 0528	0.30 (#24)
Mistral large 2512	0.33 (#25)
GLM 4.5	0.34 (#26)
Minimax m2	0.36 (#27)
Qwen3 next 80b a3b thinking	0.40 (#28)
Gemini 2.5 flash	0.43 (#29)
Claude 3.5 haiku	0.67 (#30)
Kimi k2 thinking	0.80 (#31)
Gpt 5 mini	0.81 (#32)
Claude haiku 4.5	1.95 (#33)
Claude sonnet 4.5	2.08 (#34)
Grok 4	3.41 (#35)
Gemini 3 pro preview	3.88 (#36)
Gemini 2.5 pro	3.95 (#37)
Gpt 5	5.43 (#38)
Claude opus 4.5	7.31 (#39)
Gpt 5.1	7.70 (#40)