CHI-Bench: New Research Reveals Major Gaps in AI Agents' Healthcare Automation Capabilities
Key Takeaways
- ▸CHI-Bench tests AI agents on realistic healthcare workflows with high-fidelity simulations of 20 healthcare applications and 87 MCP-based tools
- ▸Best-performing agents resolved only 28% of tasks, revealing a substantial gap between current AI capabilities and production-ready automation
- ▸AI agents struggle significantly with policy-dense, multi-role workflows requiring adherence to hundreds of operational rules and role transitions
Summary
Researchers have introduced CHI-Bench, a comprehensive benchmark testing AI agents' ability to automate complex, end-to-end healthcare workflows. The benchmark evaluates agents across three critical healthcare domains: provider prior authorization, payer utilization management, and care management. Tasks require agents to navigate a high-fidelity simulator featuring 20 healthcare applications and 87 specialized tools while adhering to a 1,290+ document operations handbook.
The results reveal significant limitations in current AI agent capabilities. Across 30 different agent and model configurations tested, the best-performing agent resolved only 28% of tasks, and no agent achieved a 20% success rate on strict criteria. When required to handle multiple tasks sequentially in a single session, performance collapsed to just 3.8%—a dramatic performance cliff that raises questions about production readiness.
The research highlights three underrepresented challenges in AI benchmarking: policy density (agents must ground decisions in massive rule libraries), multi-role composition (tasks requiring agents to assume different roles with handoffs), and multilateral interaction (workflow steps involving complex multi-turn dialogs). Researchers hypothesize that similar capability gaps likely exist in other policy-dense, role-composed enterprise domains requiring irreversible decisions.
- Sequential task performance collapsed to 3.8%, indicating fundamental limitations in maintaining context and consistency across complex workflows
- Findings suggest similar capability gaps likely exist in other complex enterprise domains requiring policy adherence and irreversible decision-making
Editorial Opinion
CHI-Bench's results deliver a necessary reality check to AI-in-healthcare enthusiasm. A 28% resolution rate on realistic workflows demonstrates we remain far from autonomous healthcare automation, and the 3.8% performance cliff in sequential scenarios is particularly telling. Current AI agents lack the contextual consistency and policy adherence needed for production deployment. Healthcare systems considering AI-driven workflow automation should view these findings as essential due diligence.



