Back to Archive
AutoBench Run 2 - April 2025
Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.
Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24
Run data
Model | AutoBench | Chatbot Ar. | AAI Index | MMLU Index |
|---|---|---|---|---|
| 4.57 (#1) | - | 69830 (#1) | 0.832 (#4) | |
| 4.46 (#2) | 1439 (#1) | 67840 (#2) | 0.858 (#1) | |
| 4.39 (#3) | 1303 (#10) | 57390 (#5) | 0.837 (#3) | |
| 4.34 (#5) | - | 52860 (#7) | 0.781 (#10) | |
| 4.34 (#4) | 1402 (#2) | 50630 (#8) | 0.799 (#8) | |
| 4.26 (#6) | 1358 (#4) | 60220 (#4) | 0.844 (#2) | |
| 4.26 (#8) | - | - | 0.69 (#18) | |
| 4.26 (#7) | 1305 (#9) | 62860 (#3) | 0.791 (#9) | |
| 4.2 (#10) | 1293 (#11) | 48150 (#10) | 0.803 (#7) | |
| 4.2 (#9) | 1342 (#6) | 37620 (#17) | 0.669 (#19) | |
| 4.18 (#11) | 1269 (#15) | 37280 (#18) | - | |
| 4.17 (#12) | 1310 (#8) | - | - | |
| 4.16 (#13) | 1372 (#3) | 53240 (#6) | 0.819 (#5) | |
| 4.16 (#14) | 1356 (#5) | 48090 (#11) | 0.779 (#11) | |
| 4.1 (#15) | 1288 (#12) | 39230 (#15) | 0.709 (#15) | |
| 4.09 (#16) | 1318 (#7) | 45580 (#12) | 0.752 (#12) | |
| 4.05 (#17) | 1249 (#17) | 38270 (#16) | 0.697 (#16) | |
| 4.02 (#18) | 1257 (#16) | 41110 (#14) | 0.713 (#14) | |
| 4 (#19) | 1272 (#13) | 35680 (#20) | 0.648 (#21) | |
| 4 (#20) | 1271 (#14) | 50530 (#9) | 0.809 (#6) | |
| 4 (#21) | - | 42990 (#13) | 0.752 (#13) | |
| 3.99 (#22) | 1237 (#19) | 34740 (#22) | 0.634 (#22) | |
| 3.89 (#23) | 1217 (#21) | 32530 (#23) | 0.59 (#23) | |
| 3.88 (#24) | 1217 (#20) | 35280 (#21) | 0.652 (#20) | |
| 3.83 (#25) | 1245 (#18) | 37080 (#19) | 0.691 (#17) |