Back to Archive
AutoBench Run 2 - April 2025
Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.
Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24
Run data
Model | Score | Avg Cost ($ Cents) | Avg Latency (sec) | P99 Latency (sec) | Iterations |
|---|---|---|---|---|---|
| 4.26 (#7) | 0.52 (#17) | 84.77s (#25) | 223.47s (#25) | - | |
| 4.39 (#3) | - (#25) | 45.80s (#24) | 82.60s (#20) | - | |
| 4.26 (#6) | 0.32 (#16) | 43.84s (#23) | 94.45s (#21) | - | |
| 4.16 (#14) | 0.10 (#12) | 42.28s (#22) | 140.54s (#24) | - | |
| 4.46 (#2) | 1.23 (#23) | 36.57s (#21) | 64.18s (#15) | - | |
| 4.17 (#12) | 0.10 (#11) | 34.73s (#20) | 66.70s (#16) | - | |
| 4.09 (#16) | 0.09 (#10) | 34.57s (#19) | 106.53s (#23) | - | |
| 4.34 (#5) | 1.70 (#24) | 33.94s (#18) | 69.79s (#17) | - | |
| 4.02 (#18) | 0.04 (#5) | 31.03s (#17) | 73.70s (#18) | - | |
| 4.2 (#9) | 0.03 (#3) | 30.03s (#16) | 79.12s (#19) | - | |
| 4.05 (#17) | 0.53 (#18) | 29.18s (#15) | 96.77s (#22) | - | |
| 4.18 (#11) | 0.04 (#7) | 25.04s (#14) | 48.74s (#13) | - | |
| 4.57 (#1) | 0.79 (#20) | 19.10s (#13) | 52.30s (#14) | - | |
| 4.2 (#10) | 1.13 (#22) | 15.53s (#12) | 32.86s (#12) | - | |
| 4.34 (#4) | 0.14 (#14) | 15.38s (#11) | 29.19s (#10) | - | |
| 3.88 (#24) | 0.01 (#1) | 13.99s (#10) | 29.62s (#11) | - | |
| 4 (#19) | 0.04 (#6) | 12.17s (#9) | 21.75s (#6) | - | |
| 4.1 (#15) | 0.85 (#21) | 11.74s (#8) | 23.32s (#8) | - | |
| 3.99 (#22) | 0.18 (#15) | 10.80s (#7) | 17.98s (#5) | - | |
| 4.26 (#8) | 0.61 (#19) | 10.69s (#6) | 23.67s (#9) | - | |
| 4 (#21) | 0.07 (#9) | 9.76s (#5) | 23.11s (#7) | - | |
| 4 (#20) | 0.05 (#8) | 8.49s (#4) | 13.82s (#4) | - | |
| 4.16 (#13) | 0.04 (#4) | 5.76s (#3) | 8.82s (#1) | - | |
| 3.83 (#25) | 0.14 (#13) | 5.65s (#2) | 9.93s (#2) | - | |
| 3.89 (#23) | 0.02 (#2) | 5.22s (#1) | 12.47s (#3) | - |