Anthropic's Claude Opus 4.7 Shows Marginal Improvements Over 4.6 in Code Generation Benchmark
Key Takeaways
- ▸Opus 4.7 maintains identical test pass rates to earlier versions but demonstrates superior code quality across multiple dimensions including equivalence, footprint risk, and discipline scoring
- ▸The new model is more cost-efficient than March Opus 4.6 ($8.11 vs $8.93 per task) while using fewer tokens and completing tasks faster, contradicting claims that 4.7 is more expensive
- ▸Opus 4.7 produces zero high-footprint patches and more human-aligned code changes, positioning it as a more disciplined coder rather than a fundamentally smarter one
Summary
A comprehensive evaluation of Anthropic's Claude Opus 4.7 versus earlier versions of Opus 4.6 on 28 real-world code generation tasks from the Zod repository reveals that while the new model maintains the same test pass rate (12/28), it demonstrates meaningful improvements in code quality and efficiency metrics. The benchmark, conducted using a custom evaluation framework called Stet that scores patches on equivalence, code-review readiness, footprint risk, and discipline, found that Opus 4.7 produces more maintainable code with lower divergence from human-authored solutions, despite the March 19 version of Opus 4.6 clearing the code-review bar slightly more often.
On practical metrics, Opus 4.7 delivers better economics and speed than its predecessor: costing $8.11 per task versus $8.93 for March Opus 4.6, consuming 44.0M tokens versus 49.1M, and completing the full 28-task run in 1 hour 30 minutes versus 1 hour 36 minutes. The newer model demonstrates particularly strong performance on footprint risk—the clearest signal in the evaluation—with zero high-risk patches compared to multiple instances in both previous versions. Notably, the fresh Opus 4.6 proved cheapest per task ($6.65) but produced looser, less equivalent patches and required 2.3x longer to complete, suggesting that raw cost savings came at the expense of output quality.
- Fresh Opus 4.6 represents a clear regression from the March version, using 28% fewer input tokens while taking 2.3x longer and producing lower-quality patches, indicating underlying model degradation
Editorial Opinion
This rigorous, real-world evaluation provides important nuance to the often-polarized discourse around Claude model quality. Rather than declaring categorical superiority or inferiority, the benchmark demonstrates that Opus 4.7's value lies in improved consistency and maintainability—practical virtues that matter more in production settings than raw benchmark scores. The finding that fresh Opus 4.6 underperforms its March predecessor while using fewer compute resources raises concerning questions about whether recent versions may have undergone unintended degradation, and underscores the importance of longitudinal evaluations beyond standard benchmarks.


