Anthropic's Test-Time Scaling Framework Dramatically Boosts Claude-4.5-Opus on Coding Benchmarks
Key Takeaways
- ▸Anthropic demonstrates a new test-time scaling framework tailored for long-horizon agentic coding tasks, departing from traditional methods designed for bounded outputs
- ▸Claude-4.5-Opus achieves 77.6% accuracy on SWE-Bench Verified, a 6.7 percentage-point improvement using the framework
- ▸The approach converts agent rollouts into compact trajectory summaries that preserve salient hypotheses, progress, and failure modes while discarding noise
Summary
Anthropic researchers have published a novel framework for test-time scaling specifically designed for agentic coding tasks. Unlike traditional test-time scaling methods optimized for bounded outputs, the framework handles the long-horizon trajectories of coding agents by converting each rollout into a structured summary that preserves key insights while discarding low-signal details. This representation-centric approach enables effective selection and reuse of prior agent experiences.
The framework introduces two complementary inference-time scaling methods: Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons, and a sequential scaling approach adapted from Parallel-Distill-Refine (PDR) that conditions new rollouts on summaries from prior attempts. Together, these methods create a system where agents can learn from and build upon previous attempts more effectively than traditional multi-attempt approaches.
The results demonstrate the framework's practical impact: Claude-4.5-Opus improved from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and from 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). These significant performance gains suggest that test-time scaling for long-horizon agents fundamentally hinges on better representing, selecting from, and reusing past experiences.
- Two complementary scaling methods—Recursive Tournament Voting for parallel scaling and adapted PDR for sequential scaling—enable agents to reuse and build on prior attempts
- The research frames test-time scaling for agents as fundamentally a problem of representation, selection, and experience reuse
Editorial Opinion
This research represents a meaningful inflection point for AI agents. By recognizing that agentic coding requires a fundamentally different approach to test-time scaling—one centered on trajectory summarization and experience reuse—Anthropic has identified a powerful lever for improving long-horizon task performance. The significant benchmark improvements suggest that this representation-first philosophy could become central to breakthrough advances in autonomous agents across domains beyond code generation.


