Back to Archive

AutoBench Run 2 - April 2025

Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.

Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24

Run data

Model
AutoBenchChatbot Ar.AAI IndexMMLU Index
4.57 (#1)-69830 (#1)0.832 (#4)
4.46 (#2)1439 (#1)67840 (#2)0.858 (#1)
4.39 (#3)1303 (#10)57390 (#5)0.837 (#3)
4.34 (#5)-52860 (#7)0.781 (#10)
4.34 (#4)1402 (#2)50630 (#8)0.799 (#8)
4.26 (#6)1358 (#4)60220 (#4)0.844 (#2)
4.26 (#8)--0.69 (#18)
4.26 (#7)1305 (#9)62860 (#3)0.791 (#9)
4.2 (#10)1293 (#11)48150 (#10)0.803 (#7)
4.2 (#9)1342 (#6)37620 (#17)0.669 (#19)
4.18 (#11)1269 (#15)37280 (#18)-
4.17 (#12)1310 (#8)--
4.16 (#13)1372 (#3)53240 (#6)0.819 (#5)
4.16 (#14)1356 (#5)48090 (#11)0.779 (#11)
4.1 (#15)1288 (#12)39230 (#15)0.709 (#15)
4.09 (#16)1318 (#7)45580 (#12)0.752 (#12)
4.05 (#17)1249 (#17)38270 (#16)0.697 (#16)
4.02 (#18)1257 (#16)41110 (#14)0.713 (#14)
4 (#19)1272 (#13)35680 (#20)0.648 (#21)
4 (#20)1271 (#14)50530 (#9)0.809 (#6)
4 (#21)-42990 (#13)0.752 (#13)
3.99 (#22)1237 (#19)34740 (#22)0.634 (#22)
3.89 (#23)1217 (#21)32530 (#23)0.59 (#23)
3.88 (#24)1217 (#20)35280 (#21)0.652 (#20)
3.83 (#25)1245 (#18)37080 (#19)0.691 (#17)
AutoBench Run 2 - April 2025 - AutoBench