Announcing AutoBench Agentic: The Next Generation Agentic Benchmark.
Based on LLM generated virtual agents, it handles countless agentic tasks to offer unbiased and granular LLM evaluation.
Based on LLM generated virtual agents, it handles countless agentic tasks to offer unbiased and granular LLM evaluation.
We are also announcing our latest benchmark (Run 5) made possible by new platform features for more powerful and efficient benchmarking: Random Score Pooling, Nonlinear Weighting, Parallel Iteration.
We teamed up with leading agritech company EVJA to drop the first-ever LLM benchmark dedicated to the agricultural sector. 40 models, 4 professional personas, and one major open-source surprise.
This run evaluated 33 models across over 300 iterations (generated questions) using 21 ranking models and generating over 220,000 individual rankings.
We're thrilled to announce that AutoBench has moved from a promising open-source project to a scientifically validated framework, with our first paper published in collaboration with Sapienza University of Rome.