BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-04-22

Anthropic Researchers Propose Test-Time Scaling Framework for Agentic Coding, Boosting Claude Performance on SWE-Bench

Key Takeaways

  • ▸Anthropic introduces a test-time scaling framework specifically designed for long-horizon agentic coding tasks, moving beyond existing methods optimized for short, bounded outputs
  • ▸The approach converts agent rollout trajectories into compact structured summaries that preserve critical information about hypotheses, progress, and failure modes for effective reuse
  • ▸Two inference-time scaling methods—Recursive Tournament Voting for parallel scaling and adapted Parallel-Distill-Refine for sequential scaling—enable substantial performance improvements
Source:
Hacker Newshttps://arxiv.org/abs/2604.16529↗

Summary

Anthropic researchers have published a new research paper on scaling test-time compute for agentic coding systems, introducing methods to improve long-horizon AI agents that perform complex software engineering tasks. The paper addresses a key limitation of existing test-time scaling approaches: while they work well for short, bounded outputs, they struggle with extended agent trajectories involving multiple actions, observations, and error states. The researchers propose a framework that converts each agent rollout into compact, structured summaries that preserve important information while discarding low-signal details, enabling more effective reuse of prior experience.

The approach introduces two complementary scaling techniques: Recursive Tournament Voting (RTV) for parallel scaling, which narrows candidate solutions through iterative comparisons, and an adapted Parallel-Distill-Refine (PDR) method for sequential scaling that conditions new attempts on distilled summaries from previous rollouts. Testing on benchmark suites SWE-Bench Verified and Terminal-Bench v2.0, the framework shows substantial improvements in Claude models' coding abilities. Claude-4.5-Opus improved from 70.9% to 77.6% on SWE-Bench Verified's mini-SWE-agent task and from 46.9% to 59.1% on Terminal-Bench v2.0's Terminus 1, demonstrating the effectiveness of representation-based scaling for agentic systems.

  • Claude-4.5-Opus achieves 77.6% on SWE-Bench Verified and 59.1% on Terminal-Bench v2.0 using the framework, showing 6-12 percentage point improvements over baseline performance

Editorial Opinion

This research represents an important advancement in making AI coding agents more capable at scale. By reframing test-time scaling as fundamentally a problem of representation and reuse rather than just generating more attempts, Anthropic addresses a practical bottleneck in long-horizon agentic systems. The substantial performance gains—particularly the 12-point improvement on Terminal-Bench—suggest this approach could meaningfully accelerate AI's utility in real-world software development tasks.

Large Language Models (LLMs)Reinforcement LearningAI Agents

More from Anthropic

AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Launches Monthly Economic Index Survey to Track AI's Real-World Impact on Work

2026-04-22
AnthropicAnthropic
INDUSTRY REPORT

Survey of 81,000 Claude Users Reveals Job Displacement Fears Align With AI Exposure Levels

2026-04-22
AnthropicAnthropic
UPDATE

Anthropic's Bun 1.1.13 Tackles Critical Memory Leaks Following Developer Complaints

2026-04-22

Comments

Suggested

Cloud Security AllianceCloud Security Alliance
POLICY & REGULATION

Cloud Security Alliance Launches CSAI Foundation to Secure Autonomous AI Agent Ecosystems

2026-04-22
AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Launches Monthly Economic Index Survey to Track AI's Real-World Impact on Work

2026-04-22
NVIDIANVIDIA
RESEARCH

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

2026-04-22
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us