EnterpriseBench: CoreCraft Benchmark Reveals Frontier AI Agents Struggle in Complex Enterprise Scenarios
Key Takeaways
- ▸Frontier AI models (GPT-5.2, Claude Opus 4.6) solve only 30-40% of realistic enterprise tasks in the CoreCraft benchmark, with more than half of problems remaining unsolved even at maximum reasoning effort
- ▸CoreCraft simulates a complex enterprise environment with 2,500+ entities, 23 tools, and 14 entity types, forcing agents to actively discover information rather than receiving it in context
- ▸Current AI agents frequently fail in dangerous ways including processing hallucinated refunds, entering infinite loops, and leaking PII into wrong channels
Summary
Workforce has introduced EnterpriseBench, a new suite of reinforcement learning environment benchmarks designed to test AI agents beyond conversational tasks and measure their ability to function in realistic enterprise settings. The first release, CoreCraft, simulates a high-growth e-commerce hardware startup with over 2,500 interconnected entities, 14 entity types, and 23 tools that agents must navigate to complete complex customer support tasks.
The benchmark reveals significant limitations in current frontier AI models. Even the most advanced systems like GPT-5.2 and Claude Opus 4.6 struggle dramatically, solving only around 30% of CoreCraft tasks at standard settings, with GPT-5.2 at maximum reasoning effort barely exceeding 40% success rate. Models frequently failed in concerning ways, including processing hallucinatory refunds, entering infinite logic loops, and accidentally leaking personally identifiable information.
Unlike traditional benchmarks that provide all necessary context upfront or operate on static datasets, CoreCraft forces agents to actively discover information from messy databases, maintain persistence through errors, and adhere to strict enterprise policies. The environment includes Slack messages, incomplete records, customer histories, and various enterprise tools that must be used correctly and in proper sequence. However, the research shows that training on CoreCraft data significantly improves agentic reasoning capabilities, boosting performance on both in-distribution tasks and external benchmarks, suggesting a path forward for developing more capable enterprise AI agents.
- Training on CoreCraft data significantly improves agentic reasoning and performance on both internal tasks and external benchmarks, suggesting potential for improvement
Editorial Opinion
EnterpriseBench: CoreCraft represents a critical reality check for the AI agent industry. While vendors increasingly promise AI systems that can autonomously handle complex business operations, this benchmark reveals that even the most advanced frontier models are nowhere near ready for unsupervised enterprise deployment. The fact that GPT-5.2 at maximum reasoning effort solves less than half of realistic customer support scenarios—and does so while occasionally hallucinating refunds or leaking sensitive data—should give pause to any organization considering deploying AI agents in production without extensive human oversight. This benchmark arrives at a crucial moment when the gap between marketing promises and actual capabilities threatens to undermine trust in practical AI applications.


