Anthropic Researchers Propose Test-Time Scaling Framework for Agentic Coding, Boosting Claude Performance on SWE-Bench

Key Takeaways

▸Anthropic introduces a test-time scaling framework specifically designed for long-horizon agentic coding tasks, moving beyond existing methods optimized for short, bounded outputs
▸The approach converts agent rollout trajectories into compact structured summaries that preserve critical information about hypotheses, progress, and failure modes for effective reuse
▸Two inference-time scaling methods—Recursive Tournament Voting for parallel scaling and adapted Parallel-Distill-Refine for sequential scaling—enable substantial performance improvements

Source:

Hacker Newshttps://arxiv.org/abs/2604.16529↗

Summary

Anthropic researchers have published a new research paper on scaling test-time compute for agentic coding systems, introducing methods to improve long-horizon AI agents that perform complex software engineering tasks. The paper addresses a key limitation of existing test-time scaling approaches: while they work well for short, bounded outputs, they struggle with extended agent trajectories involving multiple actions, observations, and error states. The researchers propose a framework that converts each agent rollout into compact, structured summaries that preserve important information while discarding low-signal details, enabling more effective reuse of prior experience.

The approach introduces two complementary scaling techniques: Recursive Tournament Voting (RTV) for parallel scaling, which narrows candidate solutions through iterative comparisons, and an adapted Parallel-Distill-Refine (PDR) method for sequential scaling that conditions new attempts on distilled summaries from previous rollouts. Testing on benchmark suites SWE-Bench Verified and Terminal-Bench v2.0, the framework shows substantial improvements in Claude models' coding abilities. Claude-4.5-Opus improved from 70.9% to 77.6% on SWE-Bench Verified's mini-SWE-agent task and from 46.9% to 59.1% on Terminal-Bench v2.0's Terminus 1, demonstrating the effectiveness of representation-based scaling for agentic systems.

Claude-4.5-Opus achieves 77.6% on SWE-Bench Verified and 59.1% on Terminal-Bench v2.0 using the framework, showing 6-12 percentage point improvements over baseline performance

Editorial Opinion

This research represents an important advancement in making AI coding agents more capable at scale. By reframing test-time scaling as fundamentally a problem of representation and reuse rather than just generating more attempts, Anthropic addresses a practical bottleneck in long-horizon agentic systems. The substantial performance gains—particularly the 12-point improvement on Terminal-Bench—suggest this approach could meaningfully accelerate AI's utility in real-world software development tasks.

Anthropic Researchers Propose Test-Time Scaling Framework for Agentic Coding, Boosting Claude Performance on SWE-Bench

Key Takeaways

▸Anthropic introduces a test-time scaling framework specifically designed for long-horizon agentic coding tasks, moving beyond existing methods optimized for short, bounded outputs
▸The approach converts agent rollout trajectories into compact structured summaries that preserve critical information about hypotheses, progress, and failure modes for effective reuse
▸Two inference-time scaling methods—Recursive Tournament Voting for parallel scaling and adapted Parallel-Distill-Refine for sequential scaling—enable substantial performance improvements

Summary

Claude-4.5-Opus achieves 77.6% on SWE-Bench Verified and 59.1% on Terminal-Bench v2.0 using the framework, showing 6-12 percentage point improvements over baseline performance

Editorial Opinion

This research represents an important advancement in making AI coding agents more capable at scale. By reframing test-time scaling as fundamentally a problem of representation and reuse rather than just generating more attempts, Anthropic addresses a practical bottleneck in long-horizon agentic systems. The substantial performance gains—particularly the 12-point improvement on Terminal-Bench—suggest this approach could meaningfully accelerate AI's utility in real-world software development tasks.

Anthropic Researchers Propose Test-Time Scaling Framework for Agentic Coding, Boosting Claude Performance on SWE-Bench

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Launches Monthly Economic Index Survey to Track AI's Real-World Impact on Work

Survey of 81,000 Claude Users Reveals Job Displacement Fears Align With AI Exposure Levels

Anthropic's Bun 1.1.13 Tackles Critical Memory Leaks Following Developer Complaints

Comments

Suggested

Cloud Security Alliance Launches CSAI Foundation to Secure Autonomous AI Agent Ecosystems

Anthropic Launches Monthly Economic Index Survey to Track AI's Real-World Impact on Work

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

Anthropic Researchers Propose Test-Time Scaling Framework for Agentic Coding, Boosting Claude Performance on SWE-Bench

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Launches Monthly Economic Index Survey to Track AI's Real-World Impact on Work

Survey of 81,000 Claude Users Reveals Job Displacement Fears Align With AI Exposure Levels

Anthropic's Bun 1.1.13 Tackles Critical Memory Leaks Following Developer Complaints

Comments

Suggested

Cloud Security Alliance Launches CSAI Foundation to Secure Autonomous AI Agent Ecosystems

Anthropic Launches Monthly Economic Index Survey to Track AI's Real-World Impact on Work

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs