Back to Archive
AutoBench Run 2 - April 2025
Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.
Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24
Run data
Model | Score | Avg Cost ($ Cents) | Avg Latency (sec) | P99 Latency (sec) | Iterations |
|---|---|---|---|---|---|
| 3.89 (#23) | 0.02 (#2) | 5s (#1) | 12s (#3) | - | |
| 3.83 (#25) | 0.14 (#13) | 6s (#2) | 10s (#2) | - | |
| 4.16 (#13) | 0.04 (#4) | 6s (#3) | 9s (#1) | - | |
| 4 (#20) | 0.05 (#8) | 8s (#4) | 14s (#4) | - | |
| 4 (#21) | 0.07 (#9) | 10s (#5) | 23s (#7) | - | |
| 4.26 (#8) | 0.61 (#19) | 11s (#6) | 24s (#9) | - | |
| 3.99 (#22) | 0.18 (#15) | 11s (#7) | 18s (#5) | - | |
| 4.1 (#15) | 0.85 (#21) | 12s (#8) | 23s (#8) | - | |
| 4 (#19) | 0.04 (#6) | 12s (#9) | 22s (#6) | - | |
| 3.88 (#24) | 0.01 (#1) | 14s (#10) | 30s (#11) | - | |
| 4.34 (#4) | 0.14 (#14) | 15s (#11) | 29s (#10) | - | |
| 4.2 (#10) | 1.13 (#22) | 16s (#12) | 33s (#12) | - | |
| 4.57 (#1) | 0.79 (#20) | 19s (#13) | 52s (#14) | - | |
| 4.18 (#11) | 0.04 (#7) | 25s (#14) | 49s (#13) | - | |
| 4.05 (#17) | 0.53 (#18) | 29s (#15) | 97s (#22) | - | |
| 4.2 (#9) | 0.03 (#3) | 30s (#16) | 79s (#19) | - | |
| 4.02 (#18) | 0.04 (#5) | 31s (#17) | 74s (#18) | - | |
| 4.34 (#5) | 1.70 (#24) | 34s (#18) | 70s (#17) | - | |
| 4.09 (#16) | 0.09 (#10) | 35s (#19) | 107s (#23) | - | |
| 4.17 (#12) | 0.10 (#11) | 35s (#20) | 67s (#16) | - | |
| 4.46 (#2) | 1.23 (#23) | 37s (#21) | 64s (#15) | - | |
| 4.16 (#14) | 0.10 (#12) | 42s (#22) | 141s (#24) | - | |
| 4.26 (#6) | 0.32 (#16) | 44s (#23) | 94s (#21) | - | |
| 4.39 (#3) | - (#25) | 46s (#24) | 83s (#20) | - | |
| 4.26 (#7) | 0.52 (#17) | 85s (#25) | 223s (#25) | - |