CHI-Bench: New Healthcare Benchmark Shows AI Agents Fail 72% of Real-World Clinical Workflows
Key Takeaways
- ▸Top-performing AI agents fail 72% of healthcare workflows, with Claude Code Opus 4.6 at 28% pass rate and GPT-5.5 at 21%
- ▸No AI agent showed reliable consistency—none maintained performance when the same case was repeated, and endurance testing revealed <4% completion on 25 consecutive cases
- ▸End-to-end multi-agent workflows completely failed: zero successful cases when different agents played different healthcare roles
Summary
actAVA.ai released CHI-Bench, the first open-source benchmark specifically designed to evaluate long-horizon AI agents on healthcare workflows. The benchmark tested 30 frontier AI agents from Anthropic, OpenAI, Google, x.AI, DeepSeek, and Z.ai across 75 complex clinical workflows—including prior authorization requests, utilization reviews, and care management tasks. The results revealed a critical performance gap: Anthropic's Claude Code with Opus 4.6, the best performer, achieved only a 28% pass rate, meaning it failed roughly seven out of ten real clinical cases.
The CHI-Bench suite simulates real healthcare environments by routing agents through 21 healthcare applications using 200+ MCP tools and a 1,279-document operations handbook. Each trial runs agents for 60-80 sequential steps across four to six clinical stages. Results showed severe limitations across the board: OpenAI's Codex with GPT-5.5 achieved 21% accuracy, with domain-specific performance ranging from 29% (prior authorization) to 41% (utilization review). Most critically, no agent maintained consistency when the same case was run three times, under endurance testing with 25 consecutive cases the best system completed fewer than 4%, and when different AI agents played different healthcare roles in end-to-end scenarios, zero tasks passed successfully.
- CHI-Bench, built by actAVA.ai with 20+ partner institutions (Johns Hopkins, Stanford, CMU, etc.), is now open-source with a live leaderboard at actava.ai/benchmarks
Editorial Opinion
CHI-Bench delivers a sobering reality check for the AI industry's healthcare ambitions. While frontier models showcase impressive capabilities in general benchmarks, this healthcare-focused research reveals a stark gap between performance claims and real-world reliability. The catastrophic failure rates—and particularly the consistency failures across repeated cases—highlight that deploying AI agents in healthcare requires far more than raw intelligence. Healthcare organizations must demand rigorous, domain-specific benchmarks like this before considering AI deployment in clinical operations.


