Claude Opus Outperforms on OpenCode: Artificial Analysis Benchmark Data Reveals Performance Disparities Across Coding Harnesses
Key Takeaways
- ▸Claude Opus 4.7 achieves higher composite scores on OpenCode compared to Claude Code harness when measured across SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA benchmarks
- ▸Harness architecture significantly impacts performance: the same underlying model shows measurable variations across different coding-agent implementations (Cursor, Claude Code, OpenCode)
- ▸Token efficiency and cost per task vary substantially across harnesses, with implications for production deployment decisions and API cost optimization
Summary
New benchmarking data from Artificial Analysis reveals that Claude Opus 4.7 demonstrates measurably higher performance when deployed on OpenCode compared to Claude Code, based on the Artificial Analysis Coding Agent Index. The composite index combines results from three major coding benchmarks: SWE-Bench-Pro-Hard-AA (code generation with 150 tasks), Terminal-Bench v2 (agentic terminal use with 84 tasks), and SWE-Atlas-QnA (technical Q&A with 124 tasks), tracking metrics including pass@1 rates across three benchmark runs.
The benchmarking analysis reveals significant variations in performance when the same underlying model is deployed across different coding-agent harnesses—including Cursor, Claude Code, and OpenCode. Beyond raw performance scores, the report tracks token efficiency metrics, showing how token consumption (both cached and non-cached input) and output tokens vary significantly across implementations. Cost analysis based on current API pricing demonstrates the tradeoffs between benchmark performance and token efficiency, with some harnesses achieving superior results while minimizing token overhead.
The data highlights that prompt caching behavior varies materially depending on provider routing and backend replica consistency, affecting both real-world costs and effective performance. These findings underscore the importance of evaluating AI coding agents not just on raw benchmark scores but also on practical metrics like token consumption, cost per task, and harness-specific optimizations.
- Prompt caching hit rates depend on provider routing and backend consistency, materially affecting effective cost and performance in real-world deployments
Editorial Opinion
These benchmarks highlight an often-overlooked reality in AI tooling: raw model capability is only part of the equation. Two coding agents using the same underlying model can deliver markedly different performance and efficiency depending on harness architecture, prompt caching strategy, and token management. For teams evaluating Anthropic's Claude for coding tasks, this data suggests that choice of deployment harness (OpenCode, Claude Code, or others) can be as consequential as model selection itself. The variance in prompt cache hit rates also reveals a practical constraint: even with identical workflows, infrastructure choices and provider routing can shift costs and performance in ways that third-party benchmarks may not fully capture.

