Back to Archive

AutoBench Run 2 - April 2025

Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.

Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24

Run data

Model
Average (All Topics)CodingCreative WritingCurrent NewsGeneral CultureGrammarHistoryLogicsMathScienceTechnology
4.57 (#1)4.55 (#1)4.51 (#1)4.57 (#1)4.59 (#1)4.6 (#1)4.61 (#1)4.48 (#1)4.57 (#1)4.67 (#1)4.61 (#1)
4.46 (#2)4.5 (#2)4.42 (#5)4.48 (#2)4.59 (#2)4.53 (#3)4.6 (#2)4.17 (#5)4.17 (#4)4.56 (#2)4.59 (#2)
4.39 (#3)4.48 (#3)4.48 (#2)4.32 (#5)4.48 (#3)4.4 (#5)4.54 (#3)4.18 (#4)4.06 (#6)4.45 (#3)4.48 (#3)
4.34 (#4)4.42 (#5)4.41 (#6)4.22 (#8)4.3 (#10)4.44 (#4)4.32 (#9)4.3 (#3)4.34 (#3)4.3 (#8)4.4 (#4)
4.34 (#5)4.33 (#6)4.47 (#3)4.36 (#3)4.43 (#4)4.54 (#2)4.45 (#4)4.05 (#9)4.07 (#5)4.42 (#4)4.36 (#6)
4.26 (#8)4.05 (#14)4.46 (#4)4.29 (#7)4.35 (#6)4.32 (#8)4.39 (#5)3.97 (#12)3.95 (#9)4.35 (#5)4.39 (#5)
4.26 (#7)4.17 (#11)4.38 (#8)4.33 (#4)4.33 (#7)4.36 (#6)4.34 (#7)4.06 (#8)3.91 (#10)4.31 (#7)4.36 (#7)
4.26 (#6)4.44 (#4)4.35 (#10)4.09 (#13)4.2 (#12)4.23 (#11)4.21 (#12)4.32 (#2)4.41 (#2)4.21 (#13)4.25 (#12)
4.2 (#9)4.27 (#7)4.41 (#7)4.15 (#12)4.31 (#9)4.14 (#15)4.34 (#8)3.96 (#14)3.87 (#13)4.29 (#10)4.3 (#9)
4.2 (#10)3.98 (#17)4.35 (#9)4.29 (#6)4.36 (#5)4.33 (#7)4.38 (#6)3.9 (#17)3.7 (#17)4.33 (#6)4.34 (#8)
4.18 (#11)4.1 (#13)4.3 (#13)4.2 (#9)4.32 (#8)4.27 (#9)4.32 (#10)3.99 (#11)3.68 (#18)4.3 (#9)4.29 (#11)
4.17 (#12)4.23 (#9)4.3 (#14)4.06 (#18)4.17 (#14)4.19 (#13)4.21 (#13)4.1 (#6)4.03 (#7)4.22 (#12)4.24 (#13)
4.16 (#13)4.25 (#8)4.33 (#11)4.17 (#11)4.17 (#13)4.22 (#12)4.18 (#14)4.07 (#7)3.97 (#8)4.11 (#19)4.13 (#17)
4.16 (#14)4.18 (#10)3.99 (#23)4.18 (#10)4.28 (#11)4.24 (#10)4.3 (#11)3.97 (#13)3.85 (#15)4.25 (#11)4.29 (#10)
4.1 (#15)4.12 (#12)4.17 (#18)4.08 (#14)4.17 (#15)4.16 (#14)4.16 (#15)3.92 (#16)3.87 (#14)4.19 (#14)4.14 (#16)
4.09 (#16)4.01 (#15)4.32 (#12)4.08 (#15)4.14 (#17)4.11 (#16)4.06 (#21)4.04 (#10)3.91 (#11)4.13 (#18)4.12 (#18)
4.05 (#17)3.98 (#18)4.19 (#17)4.07 (#17)4.08 (#21)4.05 (#20)4.09 (#20)3.87 (#19)3.88 (#12)4.17 (#15)4.18 (#15)
4.02 (#18)3.83 (#23)4.02 (#22)4.07 (#16)4.17 (#16)4.1 (#17)4.13 (#17)3.93 (#15)3.52 (#24)4.15 (#16)4.21 (#14)
4 (#20)3.97 (#20)4.2 (#16)4 (#22)4.1 (#20)3.97 (#22)4.03 (#22)3.82 (#22)3.79 (#16)4.07 (#22)4.07 (#23)
4 (#19)3.98 (#19)4.04 (#21)3.99 (#23)4.05 (#22)4.1 (#18)4.1 (#19)3.86 (#20)3.64 (#19)4.1 (#20)4.1 (#19)
4 (#21)3.88 (#21)4.04 (#20)4.04 (#20)4.1 (#19)4.09 (#19)4.11 (#18)3.89 (#18)3.53 (#23)4.14 (#17)4.09 (#20)
3.99 (#22)4 (#16)4.2 (#15)4.04 (#19)4.11 (#18)3.98 (#21)4.15 (#16)3.85 (#21)3.44 (#25)4.05 (#23)4.07 (#22)
3.89 (#23)3.73 (#25)3.86 (#24)3.86 (#24)4.04 (#24)3.9 (#25)4.02 (#23)3.77 (#23)3.56 (#21)4.02 (#24)4.05 (#24)
3.88 (#24)3.86 (#22)3.42 (#25)4.01 (#21)4.05 (#23)3.94 (#23)4.02 (#24)3.66 (#25)3.59 (#20)4.09 (#21)4.08 (#21)
3.83 (#25)3.81 (#24)4.06 (#19)3.78 (#25)3.9 (#25)3.91 (#24)3.82 (#25)3.74 (#24)3.56 (#22)3.86 (#25)3.86 (#25)
AutoBench Run 2 - April 2025 - AutoBench