Back to Archive

AutoBench Run 2 - April 2025

Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.

Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24

Run data

Model
Average (All Topics)CodingCreative WritingCurrent NewsGeneral CultureGrammarHistoryLogicsMathScienceTechnology
0.01 (#1)0.02 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.02 (#1)0.01 (#1)0.01 (#1)
0.02 (#2)0.02 (#2)0.01 (#2)0.01 (#2)0.01 (#2)0.01 (#2)0.01 (#2)0.02 (#2)0.03 (#2)0.01 (#2)0.01 (#2)
0.03 (#3)0.04 (#3)0.02 (#4)0.02 (#3)0.02 (#3)0.02 (#3)0.02 (#3)0.02 (#3)0.04 (#3)0.02 (#3)0.02 (#3)
0.04 (#4)0.08 (#7)0.02 (#3)0.03 (#4)0.03 (#4)0.03 (#4)0.03 (#4)0.04 (#4)0.06 (#5)0.03 (#5)0.03 (#4)
0.04 (#5)0.05 (#4)0.02 (#5)0.03 (#6)0.03 (#6)0.03 (#6)0.03 (#6)0.04 (#5)0.06 (#4)0.03 (#6)0.03 (#5)
0.04 (#6)0.05 (#5)0.03 (#6)0.03 (#7)0.03 (#7)0.03 (#7)0.04 (#7)0.04 (#7)0.06 (#6)0.03 (#7)0.03 (#6)
0.04 (#7)0.06 (#6)0.04 (#8)0.03 (#5)0.03 (#5)0.03 (#5)0.03 (#5)0.04 (#6)0.06 (#7)0.03 (#4)0.03 (#7)
0.05 (#8)0.08 (#8)0.03 (#7)0.04 (#8)0.04 (#8)0.04 (#8)0.04 (#8)0.05 (#8)0.07 (#8)0.04 (#8)0.04 (#8)
0.07 (#9)0.11 (#9)0.04 (#9)0.05 (#9)0.05 (#9)0.05 (#9)0.05 (#9)0.09 (#9)0.11 (#9)0.05 (#9)0.05 (#9)
0.09 (#10)0.14 (#10)0.09 (#11)0.08 (#12)0.08 (#12)0.07 (#10)0.07 (#11)0.11 (#11)0.15 (#10)0.07 (#12)0.08 (#10)
0.09 (#11)0.15 (#11)0.11 (#13)0.07 (#10)0.07 (#11)0.07 (#11)0.07 (#12)0.10 (#10)0.15 (#11)0.07 (#10)0.08 (#11)
0.10 (#12)0.16 (#12)0.06 (#10)0.08 (#11)0.07 (#10)0.07 (#12)0.07 (#10)0.15 (#13)0.17 (#12)0.07 (#11)0.12 (#14)
0.14 (#13)0.25 (#13)0.15 (#14)0.11 (#14)0.11 (#14)0.11 (#13)0.10 (#14)0.14 (#12)0.21 (#14)0.10 (#14)0.10 (#12)
0.15 (#14)0.25 (#14)0.09 (#12)0.10 (#13)0.09 (#13)0.12 (#14)0.10 (#13)0.19 (#15)0.31 (#15)0.09 (#13)0.11 (#13)
0.18 (#15)0.33 (#15)0.15 (#15)0.16 (#15)0.16 (#15)0.15 (#15)0.18 (#16)0.17 (#14)0.21 (#13)0.15 (#15)0.16 (#16)
0.32 (#16)0.51 (#16)0.17 (#16)0.17 (#16)0.17 (#16)0.22 (#16)0.15 (#15)0.60 (#17)0.85 (#17)0.17 (#16)0.15 (#15)
0.52 (#17)0.82 (#17)0.32 (#17)0.29 (#17)0.28 (#17)0.37 (#17)0.28 (#17)0.86 (#20)1.36 (#21)0.29 (#17)0.28 (#17)
0.52 (#18)0.83 (#18)0.46 (#19)0.47 (#19)0.46 (#19)0.42 (#19)0.46 (#19)0.47 (#16)0.72 (#16)0.47 (#19)0.50 (#19)
0.61 (#19)0.93 (#19)0.43 (#18)0.35 (#18)0.38 (#18)0.41 (#18)0.38 (#18)0.96 (#22)1.56 (#22)0.37 (#18)0.36 (#18)
0.79 (#20)1.20 (#20)0.60 (#21)0.63 (#20)0.63 (#20)0.80 (#21)0.61 (#20)0.99 (#23)1.30 (#19)0.61 (#20)0.56 (#20)
0.85 (#21)1.30 (#21)0.55 (#20)0.66 (#21)0.66 (#21)0.65 (#20)0.69 (#21)0.95 (#21)1.72 (#23)0.61 (#21)0.67 (#21)
1.13 (#22)2.26 (#22)0.88 (#23)0.90 (#22)1.13 (#22)0.85 (#22)1.14 (#22)0.84 (#19)1.34 (#20)1.00 (#22)1.00 (#22)
1.23 (#23)2.95 (#24)0.64 (#22)0.98 (#23)1.22 (#23)1.21 (#23)1.16 (#23)0.78 (#18)1.08 (#18)1.13 (#23)1.11 (#23)
1.69 (#24)2.64 (#23)1.03 (#24)1.24 (#24)1.30 (#24)1.41 (#24)1.32 (#24)2.14 (#24)2.83 (#24)1.22 (#24)1.83 (#24)
4.32 (#25)7.97 (#25)2.55 (#25)2.26 (#25)2.74 (#25)2.58 (#25)2.94 (#25)6.54 (#25)10.23 (#25)2.79 (#25)2.59 (#25)
AutoBench Run 2 - April 2025 - AutoBench