Back to Archive

AutoBench Run 3 - August 2025

Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates

Latest
Date
August 14, 2025
Version
2025-08-14
Models
33
New Models
26

Run data

Model
AutoBenchChatbot Ar.AAI IndexMMLU Index
4.51 (#1)1481 (#1)68950 (#1)0.871 (#1)
4.42 (#4)1458 (#2)64630 (#5)0.862 (#3)
4.41 (#5)1451 (#3)67070 (#3)0.853 (#4)
4.24 (#11)1446 (#4)58830 (#10)-
4.31 (#9)1430 (#5)67520 (#2)0.866 (#2)
4.18 (#13)1420 (#6)48560 (#17)0.824 (#14)
4.18 (#12)1418 (#7)58740 (#11)0.849 (#5)
4.18 (#14)1414 (#8)56080 (#14)0.835 (#8)
4.32 (#8)1409 (#9)58430 (#12)0.759 (#23)
4.17 (#16)1406 (#10)46770 (#18)0.806 (#19)
4.39 (#6)1401 (#11)63590 (#7)0.843 (#6)
4.17 (#15)1399 (#12)61000 (#9)0.842 (#7)
4.27 (#10)1398 (#13)65050 (#4)0.832 (#10)
3.95 (#23)1390 (#14)43990 (#22)0.819 (#15)
3.95 (#22)1380 (#15)42340 (#23)0.777 (#20)
3.98 (#20)1379 (#16)49475 (#16)0.815 (#16)
3.88 (#25)1363 (#17)25220 (#31)0.669 (#30)
4.06 (#17)1360 (#18)58010 (#13)0.828 (#12)
4.48 (#3)1356 (#19)61340 (#8)0.808 (#18)
4.02 (#19)1351 (#20)44348 (#21)0.832 (#9)
3.71 (#27)1347 (#21)35950 (#26)0.746 (#25)
4.02 (#18)1345 (#22)46420 (#19)0.825 (#13)
3.64 (#29)1330 (#23)41730 (#24)0.809 (#17)
3.88 (#24)1324 (#24)40473 (#25)0.698 (#27)
3.61 (#30)1318 (#25)33060 (#27)0.752 (#24)
3.59 (#31)1317 (#26)23326 (#33)0.634 (#31)
3.71 (#26)1313 (#27)27013 (#30)0.697 (#28)
3.49 (#33)1289 (#28)28830 (#28)0.691 (#29)
3.54 (#32)1262 (#29)24540 (#32)0.59 (#32)
3.66 (#28)1258 (#30)27950 (#29)0.714 (#26)
3.98 (#21)-45235 (#20)0.774 (#21)
4.49 (#2)-63700 (#6)0.828 (#11)
4.33 (#7)-53780 (#15)0.772 (#22)
AutoBench Run 3 - August 2025 - AutoBench