Back to Archive
AutoBench Agronomy LLM Benchmark - December 2025
The first AutoBench run for the Agronomy domain with models Gemini 3 Pro, Gpt 5.1, Grok 4.1, Opus 4.5 and more
Past
Date
December 10, 2025
Version
2025-12-10
Models
40
New Models
17
Run data
Model | Score | Avg Cost ($ Cents) | Avg Latency (sec) | P99 Latency (sec) | Iterations |
|---|---|---|---|---|---|
| 3.444 (#38) | 0.01 (#1) | 15s (#7) | 60s (#4) | 205 | |
| 3.611 (#35) | 0.02 (#3) | 15s (#8) | 60s (#5) | 205 | |
| 2.9 (#40) | 0.02 (#4) | 20s (#11) | 143s (#17) | 186 | |
| 3.513 (#36) | 0.02 (#2) | 7s (#1) | 42s (#1) | 205 | |
| 4.46 (#19) | 0.03 (#6) | 22s (#13) | 174s (#26) | 204 | |
| 4.339 (#25) | 0.03 (#7) | 31s (#18) | 112s (#15) | 204 | |
| 3.434 (#39) | 0.03 (#5) | 18s (#10) | 89s (#11) | 194 | |
| 3.659 (#34) | 0.05 (#8) | 12s (#5) | 65s (#7) | 205 | |
| 4.183 (#30) | 0.07 (#10) | 26s (#16) | 101s (#14) | 205 | |
| 4.329 (#26) | 0.07 (#9) | 11s (#4) | 80s (#9) | 200 | |
| 4.574 (#11) | 0.07 (#11) | 35s (#20) | 153s (#21) | 205 | |
| 4.64 (#4) | 0.07 (#12) | 45s (#25) | 177s (#27) | 197 | |
| 4.378 (#23) | 0.08 (#14) | 71s (#36) | 381s (#40) | 194 | |
| 4.582 (#10) | 0.08 (#13) | 24s (#15) | 65s (#6) | 197 | |
| 4.377 (#24) | 0.10 (#17) | 29s (#17) | 156s (#22) | 205 | |
| 4.32 (#27) | 0.10 (#16) | 23s (#14) | 97s (#13) | 204 | |
| 3.911 (#32) | 0.10 (#15) | 8s (#2) | 56s (#3) | 203 | |
| 4.269 (#29) | 0.11 (#18) | 36s (#22) | 166s (#25) | 196 | |
| 4.585 (#9) | 0.13 (#19) | 74s (#37) | 255s (#34) | 193 | |
| 4.279 (#28) | 0.16 (#21) | 35s (#21) | 144s (#20) | 196 | |
| 3.476 (#37) | 0.16 (#20) | 8s (#3) | 46s (#2) | 205 | |
| 4.517 (#17) | 0.21 (#22) | 21s (#12) | 86s (#10) | 205 | |
| 4.163 (#31) | 0.21 (#23) | 36s (#23) | 162s (#24) | 203 | |
| 4.536 (#14) | 0.30 (#24) | 54s (#30) | 159s (#23) | 198 | |
| 4.586 (#8) | 0.33 (#25) | 62s (#31) | 143s (#18) | 175 | |
| 4.556 (#13) | 0.34 (#26) | 51s (#28) | 201s (#29) | 204 | |
| 4.524 (#16) | 0.36 (#27) | 68s (#34) | 239s (#33) | 193 | |
| 4.439 (#22) | 0.40 (#28) | 32s (#19) | 127s (#16) | 204 | |
| 4.475 (#18) | 0.43 (#29) | 17s (#9) | 90s (#12) | 204 | |
| 3.676 (#33) | 0.67 (#30) | 12s (#6) | 73s (#8) | 205 | |
| 4.559 (#12) | 0.80 (#31) | 68s (#33) | 360s (#38) | 192 | |
| 4.594 (#7) | 0.81 (#32) | 74s (#38) | 224s (#31) | 196 | |
| 4.445 (#21) | 1.95 (#33) | 53s (#29) | 365s (#39) | 196 | |
| 4.453 (#20) | 2.08 (#34) | 42s (#24) | 283s (#35) | 203 | |
| 4.535 (#15) | 3.41 (#35) | 70s (#35) | 220s (#30) | 197 | |
| 4.642 (#3) | 3.88 (#36) | 46s (#26) | 143s (#19) | 194 | |
| 4.63 (#5) | 3.95 (#37) | 50s (#27) | 187s (#28) | 205 | |
| 4.827 (#2) | 5.43 (#38) | 112s (#39) | 312s (#36) | 192 | |
| 4.6 (#6) | 7.31 (#39) | 66s (#32) | 238s (#32) | 194 | |
| 4.849 (#1) | 7.70 (#40) | 141s (#40) | 348s (#37) | 195 |