Back to Archive
AutoBench Run 3 - August 2025
Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates
Latest
Date
August 14, 2025
Version
2025-08-14
Models
33
New Models
26
Run data
Model | Average (All Topics) | Coding | Creative Writing | Current News | General Culture | Grammar | History | Logics | Math | Science | Technology |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.02 (#1) | 0.03 (#1) | 0.01 (#1) | 0.02 (#1) | 0.01 (#1) | 0.01 (#1) | 0.02 (#1) | 0.02 (#1) | 0.02 (#1) | 0.02 (#1) | 0.02 (#1) | |
| 0.03 (#3) | 0.04 (#3) | 0.02 (#2) | 0.03 (#3) | 0.02 (#3) | 0.02 (#3) | 0.03 (#3) | 0.03 (#3) | 0.04 (#3) | 0.03 (#3) | 0.03 (#3) | |
| 0.02 (#2) | 0.03 (#2) | 0.02 (#3) | 0.02 (#2) | 0.02 (#2) | 0.02 (#2) | 0.02 (#2) | 0.02 (#2) | 0.04 (#2) | 0.02 (#2) | 0.02 (#2) | |
| 0.11 (#10) | 0.16 (#10) | 0.02 (#4) | 0.04 (#6) | 0.03 (#4) | 0.04 (#4) | 0.07 (#10) | 0.28 (#14) | 0.30 (#14) | 0.05 (#8) | 0.08 (#10) | |
| 0.04 (#4) | 0.05 (#4) | 0.03 (#5) | 0.04 (#5) | 0.04 (#7) | 0.04 (#5) | 0.04 (#5) | 0.04 (#4) | 0.05 (#4) | 0.04 (#4) | 0.04 (#4) | |
| 0.05 (#5) | 0.06 (#5) | 0.03 (#6) | 0.04 (#4) | 0.04 (#6) | 0.04 (#6) | 0.04 (#4) | 0.07 (#6) | 0.06 (#5) | 0.04 (#5) | 0.04 (#5) | |
| 0.05 (#6) | 0.07 (#6) | 0.03 (#7) | 0.04 (#7) | 0.03 (#5) | 0.04 (#7) | 0.05 (#6) | 0.07 (#5) | 0.07 (#6) | 0.04 (#6) | 0.04 (#6) | |
| 0.08 (#8) | 0.10 (#7) | 0.04 (#8) | 0.05 (#8) | 0.04 (#9) | 0.06 (#9) | 0.05 (#8) | 0.19 (#12) | 0.19 (#10) | 0.05 (#9) | 0.05 (#8) | |
| 0.08 (#7) | 0.12 (#8) | 0.04 (#9) | 0.05 (#9) | 0.04 (#8) | 0.05 (#8) | 0.05 (#7) | 0.15 (#8) | 0.19 (#9) | 0.05 (#7) | 0.05 (#7) | |
| 0.09 (#9) | 0.14 (#9) | 0.05 (#10) | 0.07 (#10) | 0.05 (#10) | 0.07 (#10) | 0.07 (#9) | 0.15 (#7) | 0.13 (#7) | 0.07 (#10) | 0.07 (#9) | |
| 0.14 (#12) | 0.20 (#12) | 0.08 (#11) | 0.14 (#13) | 0.10 (#13) | 0.09 (#11) | 0.12 (#13) | 0.16 (#9) | 0.21 (#11) | 0.12 (#13) | 0.14 (#13) | |
| 0.12 (#11) | 0.18 (#11) | 0.08 (#12) | 0.11 (#12) | 0.08 (#12) | 0.10 (#12) | 0.10 (#12) | 0.18 (#11) | 0.16 (#8) | 0.10 (#12) | 0.10 (#12) | |
| 0.20 (#14) | 0.21 (#13) | 0.10 (#13) | 0.10 (#11) | 0.07 (#11) | 0.11 (#13) | 0.10 (#11) | 0.61 (#18) | 0.52 (#16) | 0.10 (#11) | 0.10 (#11) | |
| 0.36 (#18) | 0.65 (#19) | 0.12 (#14) | 0.20 (#17) | 0.14 (#15) | 0.31 (#18) | 0.19 (#15) | 0.67 (#19) | 0.81 (#20) | 0.24 (#18) | 0.27 (#18) | |
| 0.18 (#13) | 0.29 (#14) | 0.15 (#15) | 0.17 (#15) | 0.13 (#14) | 0.18 (#15) | 0.16 (#14) | 0.17 (#10) | 0.25 (#13) | 0.15 (#14) | 0.15 (#14) | |
| 0.45 (#20) | 1.01 (#22) | 0.15 (#16) | 0.39 (#21) | 0.24 (#19) | 0.33 (#20) | 0.37 (#20) | 0.40 (#16) | 0.65 (#17) | 0.42 (#22) | 0.43 (#22) | |
| 0.24 (#15) | 0.29 (#15) | 0.17 (#17) | 0.24 (#18) | 0.22 (#18) | 0.17 (#14) | 0.29 (#18) | 0.28 (#13) | 0.22 (#12) | 0.24 (#17) | 0.23 (#17) | |
| 0.24 (#16) | 0.32 (#16) | 0.19 (#18) | 0.18 (#16) | 0.16 (#17) | 0.26 (#17) | 0.22 (#17) | 0.35 (#15) | 0.38 (#15) | 0.17 (#15) | 0.19 (#16) | |
| 0.35 (#17) | 0.52 (#18) | 0.21 (#19) | 0.17 (#14) | 0.14 (#16) | 0.23 (#16) | 0.21 (#16) | 0.77 (#22) | 0.84 (#21) | 0.21 (#16) | 0.19 (#15) | |
| 0.63 (#22) | 1.22 (#24) | 0.22 (#20) | 0.37 (#20) | 0.30 (#21) | 0.37 (#21) | 0.38 (#21) | 1.21 (#24) | 1.44 (#25) | 0.38 (#21) | 0.41 (#21) | |
| 0.64 (#24) | 1.31 (#26) | 0.24 (#21) | 0.36 (#19) | 0.29 (#20) | 0.32 (#19) | 0.35 (#19) | 1.36 (#25) | 1.57 (#26) | 0.34 (#19) | 0.35 (#19) | |
| 0.42 (#19) | 0.45 (#17) | 0.28 (#22) | 0.44 (#22) | 0.36 (#22) | 0.41 (#22) | 0.38 (#22) | 0.67 (#21) | 0.66 (#18) | 0.37 (#20) | 0.37 (#20) | |
| 0.63 (#23) | 0.84 (#20) | 0.36 (#23) | 0.56 (#23) | 0.42 (#23) | 0.50 (#24) | 0.52 (#23) | 0.89 (#23) | 1.10 (#23) | 0.56 (#23) | 0.58 (#23) | |
| 0.61 (#21) | 0.96 (#21) | 0.46 (#24) | 0.61 (#24) | 0.46 (#24) | 0.46 (#23) | 0.61 (#24) | 0.57 (#17) | 0.73 (#19) | 0.58 (#24) | 0.59 (#24) | |
| 0.91 (#27) | 1.28 (#25) | 0.48 (#25) | 0.80 (#27) | 0.53 (#25) | 0.72 (#26) | 0.75 (#26) | 1.40 (#26) | 1.69 (#27) | 0.76 (#26) | 0.75 (#26) | |
| 0.83 (#25) | 1.43 (#27) | 0.60 (#26) | 0.78 (#26) | 0.69 (#27) | 0.70 (#25) | 0.83 (#27) | 0.67 (#20) | 0.87 (#22) | 0.76 (#27) | 0.77 (#27) | |
| 0.87 (#26) | 1.03 (#23) | 0.63 (#27) | 0.71 (#25) | 0.61 (#26) | 0.73 (#27) | 0.69 (#25) | 1.60 (#29) | 1.25 (#24) | 0.67 (#25) | 0.74 (#25) | |
| 1.59 (#28) | 2.77 (#29) | 0.73 (#28) | 1.51 (#29) | 1.12 (#29) | 1.22 (#30) | 1.49 (#29) | 1.52 (#27) | 2.21 (#29) | 1.61 (#30) | 1.52 (#29) | |
| 1.85 (#30) | 1.83 (#28) | 0.94 (#29) | 1.52 (#30) | 1.01 (#28) | 1.05 (#28) | 1.34 (#28) | 3.76 (#30) | 5.16 (#30) | 1.26 (#28) | 1.36 (#28) | |
| 1.71 (#29) | 3.74 (#30) | 0.99 (#30) | 1.47 (#28) | 1.16 (#30) | 1.13 (#29) | 1.59 (#30) | 1.52 (#28) | 1.81 (#28) | 1.52 (#29) | 1.55 (#30) | |
| 2.92 (#31) | 5.10 (#31) | 1.49 (#31) | 2.30 (#31) | 1.62 (#31) | 2.76 (#31) | 2.20 (#31) | 5.12 (#31) | 5.97 (#31) | 2.11 (#31) | 2.26 (#31) | |
| 4.37 (#32) | 6.01 (#32) | 2.75 (#32) | 3.79 (#32) | 3.07 (#32) | 3.33 (#32) | 3.68 (#32) | 6.20 (#32) | 7.59 (#32) | 3.76 (#32) | 3.82 (#32) | |
| 9.13 (#33) | 18.54 (#33) | 5.81 (#33) | 7.95 (#33) | 7.09 (#33) | 6.34 (#33) | 9.15 (#33) | 7.76 (#33) | 8.97 (#33) | 8.11 (#33) | 8.62 (#33) |