Anthropic Researchers Propose Test-Time Scaling Framework for Agentic Coding, Boosting Claude Performance on SWE-Bench
Key Takeaways
- ▸Anthropic introduces a test-time scaling framework specifically designed for long-horizon agentic coding tasks, moving beyond existing methods optimized for short, bounded outputs
- ▸The approach converts agent rollout trajectories into compact structured summaries that preserve critical information about hypotheses, progress, and failure modes for effective reuse
- ▸Two inference-time scaling methods—Recursive Tournament Voting for parallel scaling and adapted Parallel-Distill-Refine for sequential scaling—enable substantial performance improvements
Summary
Anthropic researchers have published a new research paper on scaling test-time compute for agentic coding systems, introducing methods to improve long-horizon AI agents that perform complex software engineering tasks. The paper addresses a key limitation of existing test-time scaling approaches: while they work well for short, bounded outputs, they struggle with extended agent trajectories involving multiple actions, observations, and error states. The researchers propose a framework that converts each agent rollout into compact, structured summaries that preserve important information while discarding low-signal details, enabling more effective reuse of prior experience.
The approach introduces two complementary scaling techniques: Recursive Tournament Voting (RTV) for parallel scaling, which narrows candidate solutions through iterative comparisons, and an adapted Parallel-Distill-Refine (PDR) method for sequential scaling that conditions new attempts on distilled summaries from previous rollouts. Testing on benchmark suites SWE-Bench Verified and Terminal-Bench v2.0, the framework shows substantial improvements in Claude models' coding abilities. Claude-4.5-Opus improved from 70.9% to 77.6% on SWE-Bench Verified's mini-SWE-agent task and from 46.9% to 59.1% on Terminal-Bench v2.0's Terminus 1, demonstrating the effectiveness of representation-based scaling for agentic systems.
- Claude-4.5-Opus achieves 77.6% on SWE-Bench Verified and 59.1% on Terminal-Bench v2.0 using the framework, showing 6-12 percentage point improvements over baseline performance
Editorial Opinion
This research represents an important advancement in making AI coding agents more capable at scale. By reframing test-time scaling as fundamentally a problem of representation and reuse rather than just generating more attempts, Anthropic addresses a practical bottleneck in long-horizon agentic systems. The substantial performance gains—particularly the 12-point improvement on Terminal-Bench—suggest this approach could meaningfully accelerate AI's utility in real-world software development tasks.


