Back to Archive

AutoBench Run 3 - August 2025

Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates

Latest
Date
August 14, 2025
Version
2025-08-14
Models
33
New Models
26

Run data

Model
Average (All Topics)CodingCreative WritingCurrent NewsGeneral CultureGrammarHistoryLogicsMathScienceTechnology
4.51 (#1)4.58 (#2)4.52 (#3)4.59 (#1)4.62 (#1)4.36 (#7)4.64 (#1)4.21 (#1)4.17 (#5)4.65 (#1)4.66 (#1)
4.41 (#5)4.43 (#5)4.3 (#13)4.56 (#2)4.5 (#4)4.39 (#5)4.57 (#2)3.96 (#7)4.16 (#6)4.58 (#4)4.61 (#4)
4.48 (#3)4.61 (#1)4.42 (#6)4.52 (#3)4.42 (#7)4.45 (#2)4.57 (#3)4.16 (#3)4.25 (#2)4.63 (#2)4.63 (#2)
4.49 (#2)4.54 (#3)4.54 (#1)4.5 (#4)4.52 (#3)4.44 (#4)4.56 (#4)4.18 (#2)4.25 (#1)4.62 (#3)4.63 (#3)
4.39 (#6)4.31 (#8)4.44 (#5)4.48 (#5)4.56 (#2)4.45 (#3)4.51 (#5)3.84 (#10)3.94 (#8)4.44 (#9)4.54 (#5)
4.42 (#4)4.52 (#4)4.39 (#7)4.42 (#6)4.49 (#5)4.45 (#1)4.48 (#7)4.09 (#4)4.24 (#3)4.52 (#5)4.52 (#7)
4.18 (#14)3.89 (#18)4.26 (#17)4.38 (#7)4.42 (#8)4.32 (#9)4.47 (#8)3.48 (#17)3.47 (#18)4.48 (#7)4.49 (#9)
4.32 (#8)4.42 (#6)4.17 (#21)4.37 (#8)4.33 (#14)4.38 (#6)4.42 (#11)4.01 (#6)4.23 (#4)4.43 (#10)4.39 (#15)
4.27 (#10)4.31 (#10)4.24 (#18)4.36 (#9)4.38 (#12)4.26 (#12)4.35 (#15)3.9 (#8)3.84 (#10)4.48 (#6)4.53 (#6)
4.33 (#7)4.41 (#7)4.38 (#8)4.35 (#10)4.4 (#10)4.3 (#11)4.45 (#9)3.88 (#9)4.13 (#7)4.37 (#15)4.52 (#8)
4.18 (#12)3.95 (#17)4.31 (#12)4.35 (#11)4.41 (#9)4.31 (#10)4.4 (#12)3.56 (#13)3.63 (#12)4.39 (#14)4.4 (#13)
4.31 (#9)4.31 (#9)4.35 (#10)4.33 (#12)4.38 (#11)4.34 (#8)4.4 (#13)4.01 (#5)3.85 (#9)4.41 (#12)4.4 (#12)
4.24 (#11)4.29 (#11)4.51 (#4)4.3 (#13)4.43 (#6)4.23 (#14)4.44 (#10)3.57 (#12)3.58 (#13)4.42 (#11)4.48 (#10)
4.17 (#15)4.19 (#13)4.36 (#9)4.3 (#14)4.33 (#15)4.25 (#13)4.35 (#14)3.55 (#14)3.48 (#17)4.4 (#13)4.39 (#14)
4.18 (#13)4.12 (#14)4.54 (#2)4.29 (#15)4.35 (#13)4.19 (#16)4.5 (#6)3.4 (#21)3.52 (#15)4.46 (#8)4.41 (#11)
4.17 (#16)4.24 (#12)4.32 (#11)4.26 (#16)4.19 (#19)4.21 (#15)4.23 (#18)3.74 (#11)3.79 (#11)4.27 (#16)4.32 (#16)
3.98 (#20)3.8 (#22)3.99 (#25)4.19 (#17)4.19 (#18)4.03 (#21)4.29 (#16)3.42 (#20)3.34 (#23)4.25 (#18)4.27 (#17)
4.02 (#18)3.77 (#24)4.27 (#16)4.18 (#18)4.22 (#16)4.09 (#18)4.26 (#17)3.47 (#18)3.43 (#20)4.23 (#21)4.24 (#19)
4.06 (#17)4.02 (#16)4.18 (#20)4.16 (#19)4.21 (#17)4.16 (#17)4.21 (#19)3.51 (#15)3.49 (#16)4.26 (#17)4.26 (#18)
3.98 (#21)3.83 (#21)4.28 (#15)4.11 (#20)4.09 (#24)4.02 (#22)4.15 (#23)3.44 (#19)3.34 (#22)4.23 (#20)4.16 (#22)
3.88 (#25)3.57 (#27)4.29 (#14)4.11 (#21)4.17 (#20)3.97 (#24)4.18 (#20)3.04 (#30)3.08 (#29)4.19 (#22)4.2 (#20)
3.95 (#22)3.8 (#23)4.17 (#22)4.09 (#22)4.16 (#22)3.94 (#25)4.14 (#24)3.49 (#16)3.39 (#21)4.13 (#24)4.15 (#23)
4.02 (#19)4.11 (#15)4.15 (#23)4.08 (#23)4.17 (#21)4.05 (#19)4.16 (#22)3.34 (#23)3.55 (#14)4.24 (#19)4.19 (#21)
3.95 (#23)3.88 (#19)4.19 (#19)4.07 (#24)4.06 (#25)4.05 (#20)4.09 (#25)3.37 (#22)3.44 (#19)4.14 (#23)4.1 (#25)
3.88 (#24)3.83 (#20)4.05 (#24)4.04 (#25)4.12 (#23)3.99 (#23)4.16 (#21)3.04 (#29)3.27 (#25)4.12 (#25)4.13 (#24)
3.71 (#27)3.74 (#25)3.23 (#33)3.92 (#26)3.89 (#28)3.84 (#27)3.97 (#26)3.22 (#24)3.28 (#24)4 (#26)3.94 (#27)
3.71 (#26)3.5 (#28)3.97 (#26)3.83 (#27)3.93 (#27)3.8 (#29)3.92 (#27)3.11 (#27)3.11 (#27)3.99 (#27)3.95 (#26)
3.59 (#31)3.47 (#30)3.86 (#29)3.74 (#28)4 (#26)3.74 (#30)3.87 (#28)2.82 (#33)2.78 (#33)3.74 (#33)3.81 (#30)
3.66 (#28)3.48 (#29)3.97 (#27)3.73 (#29)3.85 (#29)3.66 (#31)3.82 (#31)3.13 (#26)3.2 (#26)3.87 (#28)3.85 (#28)
3.54 (#32)3.32 (#33)3.78 (#31)3.71 (#30)3.77 (#32)3.51 (#33)3.76 (#32)2.99 (#31)2.95 (#31)3.82 (#30)3.75 (#32)
3.64 (#29)3.59 (#26)3.74 (#32)3.7 (#31)3.78 (#31)3.83 (#28)3.82 (#30)3.13 (#25)3.1 (#28)3.83 (#29)3.79 (#31)
3.61 (#30)3.37 (#31)3.86 (#28)3.66 (#32)3.83 (#30)3.85 (#26)3.84 (#29)3.05 (#28)3 (#30)3.82 (#31)3.84 (#29)
3.49 (#33)3.36 (#32)3.84 (#30)3.55 (#33)3.72 (#33)3.53 (#32)3.63 (#33)2.96 (#32)2.85 (#32)3.75 (#32)3.61 (#33)
AutoBench Run 3 - August 2025 - AutoBench