BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-04-28

Anthropic's Test-Time Scaling Framework Dramatically Boosts Claude-4.5-Opus on Coding Benchmarks

Key Takeaways

  • ▸Anthropic demonstrates a new test-time scaling framework tailored for long-horizon agentic coding tasks, departing from traditional methods designed for bounded outputs
  • ▸Claude-4.5-Opus achieves 77.6% accuracy on SWE-Bench Verified, a 6.7 percentage-point improvement using the framework
  • ▸The approach converts agent rollouts into compact trajectory summaries that preserve salient hypotheses, progress, and failure modes while discarding noise
Source:
Hacker Newshttps://arxiv.org/abs/2604.16529↗

Summary

Anthropic researchers have published a novel framework for test-time scaling specifically designed for agentic coding tasks. Unlike traditional test-time scaling methods optimized for bounded outputs, the framework handles the long-horizon trajectories of coding agents by converting each rollout into a structured summary that preserves key insights while discarding low-signal details. This representation-centric approach enables effective selection and reuse of prior agent experiences.

The framework introduces two complementary inference-time scaling methods: Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons, and a sequential scaling approach adapted from Parallel-Distill-Refine (PDR) that conditions new rollouts on summaries from prior attempts. Together, these methods create a system where agents can learn from and build upon previous attempts more effectively than traditional multi-attempt approaches.

The results demonstrate the framework's practical impact: Claude-4.5-Opus improved from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and from 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). These significant performance gains suggest that test-time scaling for long-horizon agents fundamentally hinges on better representing, selecting from, and reusing past experiences.

  • Two complementary scaling methods—Recursive Tournament Voting for parallel scaling and adapted PDR for sequential scaling—enable agents to reuse and build on prior attempts
  • The research frames test-time scaling for agents as fundamentally a problem of representation, selection, and experience reuse

Editorial Opinion

This research represents a meaningful inflection point for AI agents. By recognizing that agentic coding requires a fundamentally different approach to test-time scaling—one centered on trajectory summarization and experience reuse—Anthropic has identified a powerful lever for improving long-horizon task performance. The significant benchmark improvements suggest that this representation-first philosophy could become central to breakthrough advances in autonomous agents across domains beyond code generation.

Large Language Models (LLMs)Reinforcement LearningAI AgentsMachine Learning

More from Anthropic

AnthropicAnthropic
RESEARCH

Frontier LLMs Outperform Specialized Clinical AI Tools Across Medical Benchmarks

2026-06-12
AnthropicAnthropic
RESEARCH

The 98% Problem: Harness Engineering Emerges as the Real Differentiator for AI Agents

2026-06-12
AnthropicAnthropic
PARTNERSHIP

Anthropic and TCS Partner to Deliver Claude to Regulated Industries at Enterprise Scale

2026-06-12

Comments

Suggested

Unnamed AI Defense Startup (Gavin Kliger, Luke Farritor, Jack Stein)Unnamed AI Defense Startup (Gavin Kliger, Luke Farritor, Jack Stein)
FUNDING & BUSINESS

Ex-DOGE Engineers Raise $130 Million for AI-Powered National Security Startup

2026-06-12
OpenAIOpenAI
RESEARCH

Study: Human and LLM Reasoning Share Pattern-Matching Mechanisms, Fail in Similar Ways

2026-06-12
AnthropicAnthropic
RESEARCH

Frontier LLMs Outperform Specialized Clinical AI Tools Across Medical Benchmarks

2026-06-12
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us