Back to Archive

AutoBench Agronomy LLM Benchmark - December 2025

The first AutoBench run for the Agronomy domain with models Gemini 3 Pro, Gpt 5.1, Grok 4.1, Opus 4.5 and more

Past
Date
December 10, 2025
Version
2025-12-10
Models
40
New Models
17

Run data

Model
ScoreAvg Cost ($ Cents)Avg Latency (sec)P99 Latency (sec)Iterations
3.444 (#38)0.01 (#1)15s (#7)60s (#4)205
3.611 (#35)0.02 (#3)15s (#8)60s (#5)205
2.9 (#40)0.02 (#4)20s (#11)143s (#17)186
3.513 (#36)0.02 (#2)7s (#1)42s (#1)205
4.46 (#19)0.03 (#6)22s (#13)174s (#26)204
4.339 (#25)0.03 (#7)31s (#18)112s (#15)204
3.434 (#39)0.03 (#5)18s (#10)89s (#11)194
3.659 (#34)0.05 (#8)12s (#5)65s (#7)205
4.183 (#30)0.07 (#10)26s (#16)101s (#14)205
4.329 (#26)0.07 (#9)11s (#4)80s (#9)200
4.574 (#11)0.07 (#11)35s (#20)153s (#21)205
4.64 (#4)0.07 (#12)45s (#25)177s (#27)197
4.378 (#23)0.08 (#14)71s (#36)381s (#40)194
4.582 (#10)0.08 (#13)24s (#15)65s (#6)197
4.377 (#24)0.10 (#17)29s (#17)156s (#22)205
4.32 (#27)0.10 (#16)23s (#14)97s (#13)204
3.911 (#32)0.10 (#15)8s (#2)56s (#3)203
4.269 (#29)0.11 (#18)36s (#22)166s (#25)196
4.585 (#9)0.13 (#19)74s (#37)255s (#34)193
4.279 (#28)0.16 (#21)35s (#21)144s (#20)196
3.476 (#37)0.16 (#20)8s (#3)46s (#2)205
4.517 (#17)0.21 (#22)21s (#12)86s (#10)205
4.163 (#31)0.21 (#23)36s (#23)162s (#24)203
4.536 (#14)0.30 (#24)54s (#30)159s (#23)198
4.586 (#8)0.33 (#25)62s (#31)143s (#18)175
4.556 (#13)0.34 (#26)51s (#28)201s (#29)204
4.524 (#16)0.36 (#27)68s (#34)239s (#33)193
4.439 (#22)0.40 (#28)32s (#19)127s (#16)204
4.475 (#18)0.43 (#29)17s (#9)90s (#12)204
3.676 (#33)0.67 (#30)12s (#6)73s (#8)205
4.559 (#12)0.80 (#31)68s (#33)360s (#38)192
4.594 (#7)0.81 (#32)74s (#38)224s (#31)196
4.445 (#21)1.95 (#33)53s (#29)365s (#39)196
4.453 (#20)2.08 (#34)42s (#24)283s (#35)203
4.535 (#15)3.41 (#35)70s (#35)220s (#30)197
4.642 (#3)3.88 (#36)46s (#26)143s (#19)194
4.63 (#5)3.95 (#37)50s (#27)187s (#28)205
4.827 (#2)5.43 (#38)112s (#39)312s (#36)192
4.6 (#6)7.31 (#39)66s (#32)238s (#32)194
4.849 (#1)7.70 (#40)141s (#40)348s (#37)195
AutoBench Agronomy LLM Benchmark - December 2025 - AutoBench