Back to Archive

AutoBench Run 3 - August 2025

Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates

Latest
Date
August 14, 2025
Version
2025-08-14
Models
33
New Models
26

Run data

Model
ScoreAvg Cost ($ Cents)Avg Latency (sec)P99 Latency (sec)Iterations
4.51 (#1)4.37 (#32)90.00s (#32)277.70s (#31)385
4.49 (#2)0.63 (#22)65.90s (#26)231.40s (#21)392
4.48 (#3)0.14 (#12)27.01s (#11)119.20s (#10)388
4.42 (#4)1.59 (#28)65.03s (#25)199.30s (#18)388
4.41 (#5)1.85 (#30)63.90s (#23)276.70s (#30)391
4.39 (#6)0.42 (#19)78.79s (#30)283.80s (#32)331
4.33 (#7)0.24 (#16)66.50s (#27)231.90s (#22)390
4.32 (#8)0.45 (#20)48.71s (#19)244.10s (#26)387
4.31 (#9)2.92 (#31)60.96s (#20)262.50s (#28)360
4.27 (#10)0.87 (#26)39.05s (#16)185.50s (#17)393
4.24 (#11)9.13 (#33)48.62s (#18)155.10s (#15)387
4.18 (#13)0.63 (#23)80.74s (#31)246.00s (#27)389
4.18 (#12)0.24 (#15)65.02s (#24)390.50s (#33)325
4.18 (#14)0.64 (#24)119.17s (#33)265.60s (#29)385
4.17 (#16)1.71 (#29)33.67s (#15)119.60s (#11)393
4.17 (#15)0.91 (#27)32.86s (#14)180.70s (#16)392
4.06 (#17)0.09 (#9)26.12s (#10)116.10s (#9)391
4.02 (#18)0.11 (#10)19.16s (#8)127.40s (#12)389
4.02 (#19)0.35 (#17)61.54s (#22)202.00s (#20)391
3.98 (#21)0.36 (#18)68.34s (#28)240.50s (#24)392
3.98 (#20)0.08 (#7)61.12s (#21)239.20s (#23)392
3.95 (#22)0.08 (#8)72.64s (#29)243.10s (#25)390
3.95 (#23)0.12 (#11)40.30s (#17)199.70s (#19)392
3.88 (#25)0.05 (#6)32.64s (#13)151.40s (#14)392
3.88 (#24)0.03 (#3)29.72s (#12)134.50s (#13)393
3.71 (#26)0.20 (#14)17.54s (#7)89.50s (#7)390
3.71 (#27)0.61 (#21)24.36s (#9)96.90s (#8)392
3.66 (#28)0.02 (#2)7.74s (#3)19.20s (#2)392
3.64 (#29)0.05 (#5)10.65s (#4)71.10s (#6)388
3.61 (#30)0.04 (#4)10.87s (#5)39.60s (#5)393
3.59 (#31)0.83 (#25)11.52s (#6)25.30s (#4)393
3.54 (#32)0.02 (#1)5.29s (#1)10.30s (#1)393
3.49 (#33)0.18 (#13)7.53s (#2)20.20s (#3)389
AutoBench Run 3 - August 2025 - AutoBench