BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-04-28

Anthropic's Test-Time Scaling Framework Dramatically Boosts Claude-4.5-Opus on Coding Benchmarks

Key Takeaways

  • ▸Anthropic demonstrates a new test-time scaling framework tailored for long-horizon agentic coding tasks, departing from traditional methods designed for bounded outputs
  • ▸Claude-4.5-Opus achieves 77.6% accuracy on SWE-Bench Verified, a 6.7 percentage-point improvement using the framework
  • ▸The approach converts agent rollouts into compact trajectory summaries that preserve salient hypotheses, progress, and failure modes while discarding noise
Source:
Hacker Newshttps://arxiv.org/abs/2604.16529↗

Summary

Anthropic researchers have published a novel framework for test-time scaling specifically designed for agentic coding tasks. Unlike traditional test-time scaling methods optimized for bounded outputs, the framework handles the long-horizon trajectories of coding agents by converting each rollout into a structured summary that preserves key insights while discarding low-signal details. This representation-centric approach enables effective selection and reuse of prior agent experiences.

The framework introduces two complementary inference-time scaling methods: Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons, and a sequential scaling approach adapted from Parallel-Distill-Refine (PDR) that conditions new rollouts on summaries from prior attempts. Together, these methods create a system where agents can learn from and build upon previous attempts more effectively than traditional multi-attempt approaches.

The results demonstrate the framework's practical impact: Claude-4.5-Opus improved from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and from 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). These significant performance gains suggest that test-time scaling for long-horizon agents fundamentally hinges on better representing, selecting from, and reusing past experiences.

  • Two complementary scaling methods—Recursive Tournament Voting for parallel scaling and adapted PDR for sequential scaling—enable agents to reuse and build on prior attempts
  • The research frames test-time scaling for agents as fundamentally a problem of representation, selection, and experience reuse

Editorial Opinion

This research represents a meaningful inflection point for AI agents. By recognizing that agentic coding requires a fundamentally different approach to test-time scaling—one centered on trajectory summarization and experience reuse—Anthropic has identified a powerful lever for improving long-horizon task performance. The significant benchmark improvements suggest that this representation-first philosophy could become central to breakthrough advances in autonomous agents across domains beyond code generation.

Large Language Models (LLMs)Reinforcement LearningAI AgentsMachine Learning

More from Anthropic

AnthropicAnthropic
RESEARCH

Ghost Couples: Study Reveals How LLMs Generate Recurring Fictional Authors That Contaminate Academic Publishing

2026-06-12
AnthropicAnthropic
RESEARCH

Frontier LLMs Outperform Specialized Clinical AI Tools Across Medical Benchmarks

2026-06-12
AnthropicAnthropic
RESEARCH

The 98% Problem: Harness Engineering Emerges as the Real Differentiator for AI Agents

2026-06-12

Comments

Suggested

MicrosoftMicrosoft
UPDATE

Microsoft Patches Critical Firmware Flaw in Surface Devices Discovered by Copilot AI

2026-06-12
AnthropicAnthropic
RESEARCH

Ghost Couples: Study Reveals How LLMs Generate Recurring Fictional Authors That Contaminate Academic Publishing

2026-06-12
Artificial AnalysisArtificial Analysis
PRODUCT LAUNCH

NVIDIA Announces AgentPerf: First Agentic AI Infrastructure Benchmark

2026-06-12
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us