Announcing AutoBench Agentic: The Next Generation Agentic Benchmark.

For more data, visit: autobench.org or our Hugging Face Leaderboard.

AutoBench has already proven to be the most unbiased, granular, and versatile LLM benchmarking framework in the LLM evaluation space. Based on the “Collective-LLM-as-a-Judge” paradigm, AutoBench uses pools of LLMs to generate prompts, responses, and evaluations. But it is with agentic benchmarking that AutoBench truly shines, demonstrating a generational leap from traditional static, task-limited, saturating, and easily gameable benchmarks.

Today we announce AutoBench Agentic and we do so by releasing the first agentic benchmark ever capable of covering hundreds of dynamically generated business cases, with 10 diverse operator roles, 10 different business domains, and 10 distinct types of agentic tasks. As usual, we don’t provide only performance metrics, but also average response costs, latency, P99, split across these 10 types of agentic calls. (Spoiler: the newly released Claude Opus 4.7 dominates). This dataset, accessible via autobench.org or our Hugging Face Leaderboard, delivers highly valuable, real-time intelligence for agentic developers, AI labs, and enterprise architects who need to know exactly which models can orchestrate complex workflows in production right now.

We are bringing all the core benefits of AutoBench to the agentic era: limited bias, extreme granularity, immense versatility, and complete resistance to benchmark data overfitting. But where AutoBench Agentic really shines is in the way it builds Virtual Environments for every single agentic task. This dynamic generation provides incredible variability, enabling us to achieve massive correlations with standard agentic benchmarks while remaining strictly un-gameable.

The Problem: The Agentic Evaluation Crisis

We had to build this because the industry is flying blind. As enterprises race to deploy autonomous agents, they are relying on evaluation frameworks that are fundamentally broken.

Current agentic benchmarks suffer from two fatal flaws.

Incredibly focused and narrow: They are often restricted to isolated niches like telecom routing or rigid system instructions—ignoring the vast, messy landscape of real enterprise work.
Struggling to reproduce actual agentic interaction: This is the more critical issue. They rely on static prompts and pre-baked datasets. This allows AI labs to simply "train to the test." Models are memorizing benchmark data rather than demonstrating genuine, dynamic reasoning, tool orchestration, and failure recovery. When these models hit the unpredictable realities of production, they crash (limitations that are made evident by our first publicly available benchmark - more below).

The Solution: Mapping the Enterprise via Virtual Environments

To solve this, AutoBench Agentic completely abandons the static text blob. Instead, our complex question-prompting generation process provides huge versatility and articulation of tasks. We build a highly reliable, real-time virtual environment for LLMs to interact with.

Our tasks are built starting from high-level agentic instructions—mirroring the exact structures you would build in a ReAct, XML-tagged, OpenClaw, or Manus agentic infrastructure—all the way down to the granular minutiae of individual tool invocations and complex, adversarial "troll" mock responses.

This infrastructure enables us to map a vast range of business contexts, operator roles, task types, and agentic frameworks dynamically. This huge variety of cases is a true differentiator for AutoBench. Here is how we map the enterprise landscape on every single run:

Operator Roles and Business Personas: Agentic behavior changes drastically depending on the assigned role. Our generation pipeline deterministically injects business-flavored personas into the Universal Intermediate Representation (UIR). In one iteration, the model might be instantiated as a "Senior Cloud DevOps Engineer" tasked with migrating server infrastructure; in the next, it is a "Cybersecurity Analyst" triaging firewall logs, or a "Junior Financial Analyst" resolving a customer billing dispute. The model must adapt its tool selection and communication style to the exact operator role required.
Stateful "Memory Lines": Real agents don't operate in a vacuum—they operate mid-workflow. Our virtual environments inject context-aware "Memory Lines" representing the state of the world before the agent was invoked. A model might receive a prompt where the memory indicates: “The previous database query failed with a syntax error, and the customer is waiting on the line.” This maps to the realities of error recovery and adaptive replanning in production.
Native Frameworks and Complex JSON Schemas: By shifting to a native Universal Intermediate Representation (UIR), we decouple the tools from the prose. We dynamically construct standard native tools[] JSON arrays that mirror the exact payload structures used by frameworks like LangChain, AutoGen, or native provider APIs. To ensure models aren't just guessing, we also inject randomized "distractor" tools, testing whether an agent can filter through noise to find the correct API.
Real-Time Complications and Multi-Turn Execution: Using our new mock tool generator, the harness forces the model into multi-turn execution loops. We intentionally inject error conditions, API timeout mocks, and missing parameters. This evaluates the true hallmark of agentic intelligence: Gap Handling—rewarding honest deferral when inputs are missing rather than hallucinating an execution.

Think about it: it takes 5 distinct phases to assemble an agentic task (4 of which imply the use of LLMs). This is more than just automated text prompting. As a matter of fact, we built a complex agentic orchestrator to deliver fully realistic and diverse tasks in a perfectly mimicked virtual agentic environment.

Once the model navigates the environment, our collective-LLM-as-a-judge system evaluates the execution trace across 8 granular criteria, including Tool Fidelity, Multi-Step Orchestration, and Parameter Complexity.

Scientific Validation: Un-gameable Yet Highly Correlated

Because our virtual environments change on every run, they are strictly un-gameable. The test set is generated at runtime. Yet, our April 2026 run data confirms that this dynamic methodology tracks beautifully with the industry's static standards:

We have achieved the precision of rigorous static benchmarks with the scalability and un-gameability of dynamic generation.

The April 2026 Agentic Run: Results & Analysis

With the methodology validated, here is how the top models performed when dropped into these dynamic, multi-turn virtual business environments.

The Saturation Myth: We Have a Long Way to Go

In standard benchmarks, we often see scores clustered near 90-95%, giving the illusion that agentic AI is a solved problem. AutoBench Agentic shatters this illusion. In our April run, all models scored in the 2.2/3.3 range (on a 1/5 scoring system).

A score of 3 signifies a "good" solution—it gets the job done, but it is far from truly robust and efficient (a score of 4), and even further away from true excellence (a score of 5). This proves we are very far away from saturating this benchmark. And if we ever do, our dynamic generation allows us to easily expand the complexity of the cases. For now, it clearly shows that current frontier models have a long way to go before they truly master agentic tasks at an excellent level.

The King of the Hill: Anthropic Dominates Orchestration
Anthropic currently has a stranglehold on complex agentic orchestration. Claude-opus-4.7 is the undisputed King of the Hill, achieving the highest overall score of 3.295. Its ability to navigate multi-step workflows, filter out distractor tools, and map complex parameters to JSON schemas under pressure is unmatched. It is closely followed by its predecessor, Claude-opus-4.6, and Google's incredibly capable Gemini-3.1-pro-preview.
The Smart Shopper: GLM-5.1 Breaks the Cost Curve
For the cost-conscious enterprise architect, GLM-5.1 is the clear "Smart Shopper" choice. Achieving an AutoBench score of 3.148, it comfortably runs with the heaviest proprietary models but at a deeply disruptive cost of just $0.005 per run—nearly 5x cheaper than Claude-opus-4.6. For scalable, high-volume agentic pipelines, this open-weight model is a massive contender.

The Shopper's View: Cost vs. Performance
When evaluating models for production, absolute performance must be weighed against operational cost. As shown in our "Shopper" scatter plot mapping Cost against AutoBench scores, the landscape is highly segmented. While Anthropic’s Claude 4.7 dominates the high-performance spectrum, its cost per run is significantly higher. Conversely, models like GLM-5.1, Mimo V2 Pro, and Qwnen 3.6 Pro offer an incredible efficiency frontier, delivering top-tier agentic reasoning at a fraction of the cost. This visualization is crucial for architects deciding between a high-cost frontier model for complex edge cases versus an efficient workhorse for high-volume pipelines.
Top 10 Models vs. Agentic Task Types
Because AutoBench Agentic generates tasks across 10 distinct agentic categories, we can pinpoint exactly where models excel and where they break down.

A closer look at this matrix reveals some sobering truths about the current state of agentic AI. Note how in Parameter Complexity, no model reaches the score of 3 on average. Furthermore, even in foundational interactions like Single Tool Call and Tool Selection, frontier models are still struggling to achieve consistent excellence. This deep granularity proves that while models can orchestrate high-level plans, the minutiae of execution often derails them.

A Disclaimer on OpenAI Models:

You may notice anomalies regarding OpenAI's performance in this specific run. Due to the aggressive anti-distillation filters applied by OpenAI's API, a high number of their responses (ranging from 27% to 47%) returned the standard refusal: "Sorry, I can't answer this...". Because of this, OpenAI model performance in this iteration may be artificially affected, as the models tripped safety and distillation filters rather than organically failing the reasoning loops.

Support & Acknowledgements

We extend our sincere gratitude to Translated for their generous support of the AutoBench project through the provision of valuable LLM compute credits. This support was instrumental in enabling the extensive evaluations conducted in this run.

We also want to express our deep appreciation to the following individuals for their extremely valuable support and insightful feedback throughout the development and execution of AutoBench:

Translated and Marco Trombetti: For their continued support in compute resources and strategic insight.
DIAG, University of Rome La Sapienza: The team led by Prof. Fabrizio Silvestri continues to provide the scientific rigor that validates our methodology.
eZecute: For enabling the industrialization of this platform.

Join the Community

AutoBench is a step towards more robust, scalable, and future-proof LLM evaluation. While our first release, AutoBench 1.0, is fully open-source, please note that the current AutoBench 2.0 implementation—and its Agentic evolution powering this run—is proprietary and closed-source.

We invite you to explore the data, read our validation paper, and join the discussion:

Explore the Leaderboard: Hugging Face Space
Download the Data: AutoBench.org Archive
Read the Validation Paper: arXiv:2510.22593
Try our 1.0 Demo on Spaces: AutoBench 1.0 Demo

We strongly encourage the AI community to engage with the interactive leaderboard, explore the released data, and share feedback.