Claude Code Performance Degraded Before Opus 4.8 Release; Root Cause Traced to CLI Harness
Key Takeaways
- ▸Daily production monitoring of Claude Code caught a statistically significant 5-day performance drop that release benchmarks would have missed
- ▸Root cause analysis traced the issue to Claude Code CLI versions 2.1.150/2.1.151, indicating a harness bug rather than a model regression
- ▸The degradation manifested as a 60% spike in tool calls per task combined with reduced input token usage, suggesting the agent was compensating for a change in the harness behavior
Summary
Anthropic's continuous monitoring of Claude Code against SWE-Bench-Pro detected a significant performance degradation in the week leading up to the Opus 4.8 release. The model's pass rate on the benchmark dropped well below its 65% baseline for five consecutive days in late May before recovering immediately after Opus 4.8 shipped on May 28. Investigation revealed the issue was likely caused by changes introduced in Claude Code CLI versions 2.1.150 and 2.1.151, not a regression in the underlying model itself. During the degradation window, tool invocations spiked by roughly 60% per task while input token usage dropped, patterns that reversed when Claude Code 2.1.153 was deployed alongside Opus 4.8. The discovery highlights the importance of continuous production monitoring to catch performance issues that standard release-time benchmarks cannot detect.
- Performance metrics returned to baseline immediately with Claude Code 2.1.153 and Opus 4.8 release on May 28, confirming the fix
- The incident underscores the value of continuous, production-grade monitoring for AI agents to detect silent performance regressions that published benchmarks at launch cannot capture


![[Please specify]](/logos/1683.png)
