Back to Archive
AutoBench Run 3 - August 2025
Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates
Latest
Date
August 14, 2025
Version
2025-08-14
Models
33
New Models
26
Run data
Model | AutoBench | Chatbot Ar. | AAI Index | MMLU Index |
|---|---|---|---|---|
| 4.24 (#11) | 1446 (#4) | 58830 (#10) | - | |
| 3.54 (#32) | 1262 (#29) | 24540 (#32) | 0.59 (#32) | |
| 3.59 (#31) | 1317 (#26) | 23326 (#33) | 0.634 (#31) | |
| 3.88 (#25) | 1363 (#17) | 25220 (#31) | 0.669 (#30) | |
| 3.49 (#33) | 1289 (#28) | 28830 (#28) | 0.691 (#29) | |
| 3.71 (#26) | 1313 (#27) | 27013 (#30) | 0.697 (#28) | |
| 3.88 (#24) | 1324 (#24) | 40473 (#25) | 0.698 (#27) | |
| 3.66 (#28) | 1258 (#30) | 27950 (#29) | 0.714 (#26) | |
| 3.71 (#27) | 1347 (#21) | 35950 (#26) | 0.746 (#25) | |
| 3.61 (#30) | 1318 (#25) | 33060 (#27) | 0.752 (#24) | |
| 4.32 (#8) | 1409 (#9) | 58430 (#12) | 0.759 (#23) | |
| 4.33 (#7) | - | 53780 (#15) | 0.772 (#22) | |
| 3.98 (#21) | - | 45235 (#20) | 0.774 (#21) | |
| 3.95 (#22) | 1380 (#15) | 42340 (#23) | 0.777 (#20) | |
| 4.17 (#16) | 1406 (#10) | 46770 (#18) | 0.806 (#19) | |
| 4.48 (#3) | 1356 (#19) | 61340 (#8) | 0.808 (#18) | |
| 3.64 (#29) | 1330 (#23) | 41730 (#24) | 0.809 (#17) | |
| 3.98 (#20) | 1379 (#16) | 49475 (#16) | 0.815 (#16) | |
| 3.95 (#23) | 1390 (#14) | 43990 (#22) | 0.819 (#15) | |
| 4.18 (#13) | 1420 (#6) | 48560 (#17) | 0.824 (#14) | |
| 4.02 (#18) | 1345 (#22) | 46420 (#19) | 0.825 (#13) | |
| 4.49 (#2) | - | 63700 (#6) | 0.828 (#11) | |
| 4.06 (#17) | 1360 (#18) | 58010 (#13) | 0.828 (#12) | |
| 4.02 (#19) | 1351 (#20) | 44348 (#21) | 0.832 (#9) | |
| 4.27 (#10) | 1398 (#13) | 65050 (#4) | 0.832 (#10) | |
| 4.18 (#14) | 1414 (#8) | 56080 (#14) | 0.835 (#8) | |
| 4.17 (#15) | 1399 (#12) | 61000 (#9) | 0.842 (#7) | |
| 4.39 (#6) | 1401 (#11) | 63590 (#7) | 0.843 (#6) | |
| 4.18 (#12) | 1418 (#7) | 58740 (#11) | 0.849 (#5) | |
| 4.41 (#5) | 1451 (#3) | 67070 (#3) | 0.853 (#4) | |
| 4.42 (#4) | 1458 (#2) | 64630 (#5) | 0.862 (#3) | |
| 4.31 (#9) | 1430 (#5) | 67520 (#2) | 0.866 (#2) | |
| 4.51 (#1) | 1481 (#1) | 68950 (#1) | 0.871 (#1) |