GPT-5.5 Outperforms Opus 4.7 in Real-World Coding Benchmark, Though Design Trade-offs Persist
Key Takeaways
- ▸GPT-5.5 is the quality leader across 56 real coding tasks, passing more tests and achieving 3x higher code review acceptance than Opus 4.7
- ▸Opus 4.7 excels at producing smaller, more disciplined patches but sometimes leaves implementation incomplete—a meaningful trade-off for different team workflows
- ▸GPT-5.5 also leads on efficiency, using fewer tokens and wall-clock time than competing models across both repos
Summary
An independent benchmark of 56 real coding tasks from two open-source repositories (Zod and graphql-go-tools) shows GPT-5.5 as the clear quality leader, passing more tests and surviving code review approximately three times as often as Anthropic's Opus 4.7. The evaluation, conducted by researcher bisonbear using a custom framework called Stet, reveals important nuances: while GPT-5.5 excels at producing production-ready patches, Opus 4.7 consistently generates smaller diffs—a trade-off that matters significantly depending on a team's workflow priorities.
The benchmark evaluated models in their native environments: OpenAI models in OpenAI's CLI and Anthropic models in Claude Code. Across the 56 scored tasks, GPT-5.5 demonstrated superior correctness, lower risk of introducing bugs, and better scope discipline. However, the two repositories showed different patterns: on Zod, GPT-5.5 and Opus 4.7 tied on raw test passes but diverged in code review acceptance; on graphql-go-tools, GPT-5.5 won decisively, with Opus 4.7's "small patch" strategy sometimes failing to implement required integration work.
GPT-5.5 also led in efficiency metrics, using fewer input and output tokens with shorter execution times compared to both Opus 4.7 and the older GPT-5.4. While GPT-5.4 remains the cost leader due to lower pricing, the benchmark suggests that per-dollar value depends on whether teams prioritize patch minimalism or completion. The research underscores a broader principle: public benchmarks often obscure the real-world trade-offs that matter for specific codebases, making repository-specific evaluations essential for teams choosing between competing AI coding agents.
- Repository-specific benchmarking reveals important context that aggregate benchmarks obscure: Zod showed trade-offs between models; graphql-go-tools showed GPT-5.5 dominance due to integration work requirements
- The choice between models depends on whether a team's bottleneck is code review throughput or minimizing patch footprint risk
Editorial Opinion
This benchmark is a valuable counterweight to abstract model comparisons that flatten performance into single numbers. The finding that GPT-5.5 and Opus 4.7 represent genuinely different trade-offs—not a simple quality hierarchy—is the most important takeaway. For developers evaluating coding agents, this suggests that real-world performance depends heavily on codebase characteristics (TypeScript schema libraries vs. Go federation engines) and team standards (review speed vs. diff minimalism). The work also validates the importance of building custom evaluation frameworks for AI tools, a practice that should become standard as these tools become more consequential in production workflows.



