Back to Archive

AutoBench Run 3 - August 2025

Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates

Latest
Date
August 14, 2025
Version
2025-08-14
Models
33
New Models
26

Run data

Model
Average (All Topics)CodingCreative WritingCurrent NewsGeneral CultureGrammarHistoryLogicsMathScienceTechnology
9.13 (#33)18.54 (#33)5.81 (#33)7.95 (#33)7.09 (#33)6.34 (#33)9.15 (#33)7.76 (#33)8.97 (#33)8.11 (#33)8.62 (#33)
4.37 (#32)6.01 (#32)2.75 (#32)3.79 (#32)3.07 (#32)3.33 (#32)3.68 (#32)6.20 (#32)7.59 (#32)3.76 (#32)3.82 (#32)
2.92 (#31)5.10 (#31)1.49 (#31)2.30 (#31)1.62 (#31)2.76 (#31)2.20 (#31)5.12 (#31)5.97 (#31)2.11 (#31)2.26 (#31)
1.85 (#30)1.83 (#28)0.94 (#29)1.52 (#30)1.01 (#28)1.05 (#28)1.34 (#28)3.76 (#30)5.16 (#30)1.26 (#28)1.36 (#28)
1.59 (#28)2.77 (#29)0.73 (#28)1.51 (#29)1.12 (#29)1.22 (#30)1.49 (#29)1.52 (#27)2.21 (#29)1.61 (#30)1.52 (#29)
1.71 (#29)3.74 (#30)0.99 (#30)1.47 (#28)1.16 (#30)1.13 (#29)1.59 (#30)1.52 (#28)1.81 (#28)1.52 (#29)1.55 (#30)
0.91 (#27)1.28 (#25)0.48 (#25)0.80 (#27)0.53 (#25)0.72 (#26)0.75 (#26)1.40 (#26)1.69 (#27)0.76 (#26)0.75 (#26)
0.64 (#24)1.31 (#26)0.24 (#21)0.36 (#19)0.29 (#20)0.32 (#19)0.35 (#19)1.36 (#25)1.57 (#26)0.34 (#19)0.35 (#19)
0.63 (#22)1.22 (#24)0.22 (#20)0.37 (#20)0.30 (#21)0.37 (#21)0.38 (#21)1.21 (#24)1.44 (#25)0.38 (#21)0.41 (#21)
0.87 (#26)1.03 (#23)0.63 (#27)0.71 (#25)0.61 (#26)0.73 (#27)0.69 (#25)1.60 (#29)1.25 (#24)0.67 (#25)0.74 (#25)
0.63 (#23)0.84 (#20)0.36 (#23)0.56 (#23)0.42 (#23)0.50 (#24)0.52 (#23)0.89 (#23)1.10 (#23)0.56 (#23)0.58 (#23)
0.83 (#25)1.43 (#27)0.60 (#26)0.78 (#26)0.69 (#27)0.70 (#25)0.83 (#27)0.67 (#20)0.87 (#22)0.76 (#27)0.77 (#27)
0.35 (#17)0.52 (#18)0.21 (#19)0.17 (#14)0.14 (#16)0.23 (#16)0.21 (#16)0.77 (#22)0.84 (#21)0.21 (#16)0.19 (#15)
0.36 (#18)0.65 (#19)0.12 (#14)0.20 (#17)0.14 (#15)0.31 (#18)0.19 (#15)0.67 (#19)0.81 (#20)0.24 (#18)0.27 (#18)
0.61 (#21)0.96 (#21)0.46 (#24)0.61 (#24)0.46 (#24)0.46 (#23)0.61 (#24)0.57 (#17)0.73 (#19)0.58 (#24)0.59 (#24)
0.42 (#19)0.45 (#17)0.28 (#22)0.44 (#22)0.36 (#22)0.41 (#22)0.38 (#22)0.67 (#21)0.66 (#18)0.37 (#20)0.37 (#20)
0.45 (#20)1.01 (#22)0.15 (#16)0.39 (#21)0.24 (#19)0.33 (#20)0.37 (#20)0.40 (#16)0.65 (#17)0.42 (#22)0.43 (#22)
0.20 (#14)0.21 (#13)0.10 (#13)0.10 (#11)0.07 (#11)0.11 (#13)0.10 (#11)0.61 (#18)0.52 (#16)0.10 (#11)0.10 (#11)
0.24 (#16)0.32 (#16)0.19 (#18)0.18 (#16)0.16 (#17)0.26 (#17)0.22 (#17)0.35 (#15)0.38 (#15)0.17 (#15)0.19 (#16)
0.11 (#10)0.16 (#10)0.02 (#4)0.04 (#6)0.03 (#4)0.04 (#4)0.07 (#10)0.28 (#14)0.30 (#14)0.05 (#8)0.08 (#10)
0.18 (#13)0.29 (#14)0.15 (#15)0.17 (#15)0.13 (#14)0.18 (#15)0.16 (#14)0.17 (#10)0.25 (#13)0.15 (#14)0.15 (#14)
0.24 (#15)0.29 (#15)0.17 (#17)0.24 (#18)0.22 (#18)0.17 (#14)0.29 (#18)0.28 (#13)0.22 (#12)0.24 (#17)0.23 (#17)
0.14 (#12)0.20 (#12)0.08 (#11)0.14 (#13)0.10 (#13)0.09 (#11)0.12 (#13)0.16 (#9)0.21 (#11)0.12 (#13)0.14 (#13)
0.08 (#8)0.10 (#7)0.04 (#8)0.05 (#8)0.04 (#9)0.06 (#9)0.05 (#8)0.19 (#12)0.19 (#10)0.05 (#9)0.05 (#8)
0.08 (#7)0.12 (#8)0.04 (#9)0.05 (#9)0.04 (#8)0.05 (#8)0.05 (#7)0.15 (#8)0.19 (#9)0.05 (#7)0.05 (#7)
0.12 (#11)0.18 (#11)0.08 (#12)0.11 (#12)0.08 (#12)0.10 (#12)0.10 (#12)0.18 (#11)0.16 (#8)0.10 (#12)0.10 (#12)
0.09 (#9)0.14 (#9)0.05 (#10)0.07 (#10)0.05 (#10)0.07 (#10)0.07 (#9)0.15 (#7)0.13 (#7)0.07 (#10)0.07 (#9)
0.05 (#6)0.07 (#6)0.03 (#7)0.04 (#7)0.03 (#5)0.04 (#7)0.05 (#6)0.07 (#5)0.07 (#6)0.04 (#6)0.04 (#6)
0.05 (#5)0.06 (#5)0.03 (#6)0.04 (#4)0.04 (#6)0.04 (#6)0.04 (#4)0.07 (#6)0.06 (#5)0.04 (#5)0.04 (#5)
0.04 (#4)0.05 (#4)0.03 (#5)0.04 (#5)0.04 (#7)0.04 (#5)0.04 (#5)0.04 (#4)0.05 (#4)0.04 (#4)0.04 (#4)
0.03 (#3)0.04 (#3)0.02 (#2)0.03 (#3)0.02 (#3)0.02 (#3)0.03 (#3)0.03 (#3)0.04 (#3)0.03 (#3)0.03 (#3)
0.02 (#2)0.03 (#2)0.02 (#3)0.02 (#2)0.02 (#2)0.02 (#2)0.02 (#2)0.02 (#2)0.04 (#2)0.02 (#2)0.02 (#2)
0.02 (#1)0.03 (#1)0.01 (#1)0.02 (#1)0.01 (#1)0.01 (#1)0.02 (#1)0.02 (#1)0.02 (#1)0.02 (#1)0.02 (#1)
AutoBench Run 3 - August 2025 - AutoBench