Anthropic's Test-Time Scaling Framework Dramatically Boosts Claude-4.5-Opus on Coding Benchmarks

Key Takeaways

▸Anthropic demonstrates a new test-time scaling framework tailored for long-horizon agentic coding tasks, departing from traditional methods designed for bounded outputs
▸Claude-4.5-Opus achieves 77.6% accuracy on SWE-Bench Verified, a 6.7 percentage-point improvement using the framework
▸The approach converts agent rollouts into compact trajectory summaries that preserve salient hypotheses, progress, and failure modes while discarding noise

Source:

Hacker Newshttps://arxiv.org/abs/2604.16529↗

Summary

Anthropic researchers have published a novel framework for test-time scaling specifically designed for agentic coding tasks. Unlike traditional test-time scaling methods optimized for bounded outputs, the framework handles the long-horizon trajectories of coding agents by converting each rollout into a structured summary that preserves key insights while discarding low-signal details. This representation-centric approach enables effective selection and reuse of prior agent experiences.

The framework introduces two complementary inference-time scaling methods: Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons, and a sequential scaling approach adapted from Parallel-Distill-Refine (PDR) that conditions new rollouts on summaries from prior attempts. Together, these methods create a system where agents can learn from and build upon previous attempts more effectively than traditional multi-attempt approaches.

The results demonstrate the framework's practical impact: Claude-4.5-Opus improved from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and from 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). These significant performance gains suggest that test-time scaling for long-horizon agents fundamentally hinges on better representing, selecting from, and reusing past experiences.

Two complementary scaling methods—Recursive Tournament Voting for parallel scaling and adapted PDR for sequential scaling—enable agents to reuse and build on prior attempts
The research frames test-time scaling for agents as fundamentally a problem of representation, selection, and experience reuse

Editorial Opinion

This research represents a meaningful inflection point for AI agents. By recognizing that agentic coding requires a fundamentally different approach to test-time scaling—one centered on trajectory summarization and experience reuse—Anthropic has identified a powerful lever for improving long-horizon task performance. The significant benchmark improvements suggest that this representation-first philosophy could become central to breakthrough advances in autonomous agents across domains beyond code generation.

Anthropic's Test-Time Scaling Framework Dramatically Boosts Claude-4.5-Opus on Coding Benchmarks

Key Takeaways

▸Anthropic demonstrates a new test-time scaling framework tailored for long-horizon agentic coding tasks, departing from traditional methods designed for bounded outputs
▸Claude-4.5-Opus achieves 77.6% accuracy on SWE-Bench Verified, a 6.7 percentage-point improvement using the framework
▸The approach converts agent rollouts into compact trajectory summaries that preserve salient hypotheses, progress, and failure modes while discarding noise

Summary

Two complementary scaling methods—Recursive Tournament Voting for parallel scaling and adapted PDR for sequential scaling—enable agents to reuse and build on prior attempts
The research frames test-time scaling for agents as fundamentally a problem of representation, selection, and experience reuse

Editorial Opinion

This research represents a meaningful inflection point for AI agents. By recognizing that agentic coding requires a fundamentally different approach to test-time scaling—one centered on trajectory summarization and experience reuse—Anthropic has identified a powerful lever for improving long-horizon task performance. The significant benchmark improvements suggest that this representation-first philosophy could become central to breakthrough advances in autonomous agents across domains beyond code generation.

Anthropic's Test-Time Scaling Framework Dramatically Boosts Claude-4.5-Opus on Coding Benchmarks

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Releases Claude Connectors for Creative Tools, Partnering with Adobe, Autodesk, Blender, and Others

DELEGATE-52 Benchmark Exposes Critical Reliability Flaws in Frontier LLMs During Document Delegation

Swedish Companies Show Poor Readiness for AI Agents—Median Score Just 14/100

Comments

Suggested

Anthropic Releases Claude Connectors for Creative Tools, Partnering with Adobe, Autodesk, Blender, and Others

TSMC Reveals Advanced CoWoS Roadmap: 48x More Compute and 34x Greater Bandwidth by 2029

VibeLens: Open-Source Tool for Visualizing and Auditing AI Agent Sessions

Anthropic's Test-Time Scaling Framework Dramatically Boosts Claude-4.5-Opus on Coding Benchmarks

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Releases Claude Connectors for Creative Tools, Partnering with Adobe, Autodesk, Blender, and Others

DELEGATE-52 Benchmark Exposes Critical Reliability Flaws in Frontier LLMs During Document Delegation

Swedish Companies Show Poor Readiness for AI Agents—Median Score Just 14/100

Comments

Suggested

Anthropic Releases Claude Connectors for Creative Tools, Partnering with Adobe, Autodesk, Blender, and Others

TSMC Reveals Advanced CoWoS Roadmap: 48x More Compute and 34x Greater Bandwidth by 2029

VibeLens: Open-Source Tool for Visualizing and Auditing AI Agent Sessions