BotBeat
...
← Back

> ▌

Stanford UniversityStanford University
RESEARCHStanford University2026-04-14

Stanford & UC Berkeley Researchers Achieve State-of-the-Art on Terminal-Bench with LLM-as-a-Verifier Framework

Key Takeaways

  • ▸LLM-as-a-Verifier achieves SOTA results on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%), outperforming Claude Opus 4.6, GPT 5.4, and Gemini models
  • ▸The framework improves upon standard LLM-as-a-Judge by using fine-grained scoring granularity, repeated verification, and criteria decomposition to reduce 27% tie rates
  • ▸Test-time verification scaling enhanced success rates from 81.8% to 86.4% through better trajectory discrimination
Source:
Hacker Newshttps://llm-as-a-verifier.notion.site↗

Summary

Researchers from Stanford AI Lab and UC Berkeley Sky Computing Lab have introduced LLM-as-a-Verifier, a novel test-time verification framework that achieves state-of-the-art performance on two major software engineering benchmarks. The method reaches 86.4% accuracy on Terminal-Bench 2 and 77.8% on SWE-Bench Verified, surpassing frontier models including Claude Opus 4.6 and GPT 5.4. The framework works by scaling scoring granularity, implementing repeated verification, and decomposing evaluation criteria to provide fine-grained feedback for trajectory reward models.

Unlike traditional LLM-as-a-Judge approaches that rely on coarse discrete scoring tokens, LLM-as-a-Verifier improves discrimination between complex agent trajectories through more nuanced evaluation. The researchers found that standard scoring methods produce ties 27% of the time on Terminal-Bench, limiting their effectiveness. By enhancing verification accuracy to 78.9% pairwise discrimination, the method improves downstream success rates from 81.8% to 86.4% through test-time scaling.

The approach is fully reproducible and publicly available on GitHub. It leverages existing scaffolds like ForgeCode and mini-swe-agent, using Claude Opus 4.6 and Gemini models to generate candidate trajectories, with Gemini 2.5 Flash serving as the verifier component.

  • The method is fully reproducible and available on GitHub for academic and developer use

Editorial Opinion

LLM-as-a-Verifier represents an important advance in test-time scaling techniques for AI agents, demonstrating that smarter verification can unlock better performance without requiring larger or more capable base models. The framework's focus on improving discrimination between trajectories through nuanced scoring is a pragmatic approach to a real problem in LLM evaluation. If these results hold up across diverse tasks, this could influence how both research institutions and commercial AI companies approach prompt engineering and reward modeling for complex reasoning tasks.

Reinforcement LearningAI AgentsMachine Learning

More from Stanford University

Stanford UniversityStanford University
INDUSTRY REPORT

Stanford's 2026 AI Index Reveals Growing Divide Between Expert and Public Opinion on AI's Impact

2026-04-14
Stanford UniversityStanford University
RESEARCH

AI Systems Achieve Breakthrough in Decoding Human Thoughts into Real-Time Text

2026-03-02

Comments

Suggested

OpenAIOpenAI
RESEARCH

OpenAI's GPT-5.4 Pro Solves Longstanding Erdős Math Problem, Reveals Novel Mathematical Connections

2026-04-17
AnthropicAnthropic
RESEARCH

AI Safety Convergence: Three Major Players Deploy Agent Governance Systems Within Weeks

2026-04-17
CloudflareCloudflare
UPDATE

Cloudflare Enables AI-Generated Apps to Have Persistent Storage with Durable Objects in Dynamic Workers

2026-04-17
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us