Stanford & UC Berkeley Researchers Achieve State-of-the-Art on Terminal-Bench with LLM-as-a-Verifier Framework
Key Takeaways
- ▸LLM-as-a-Verifier achieves SOTA results on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%), outperforming Claude Opus 4.6, GPT 5.4, and Gemini models
- ▸The framework improves upon standard LLM-as-a-Judge by using fine-grained scoring granularity, repeated verification, and criteria decomposition to reduce 27% tie rates
- ▸Test-time verification scaling enhanced success rates from 81.8% to 86.4% through better trajectory discrimination
Summary
Researchers from Stanford AI Lab and UC Berkeley Sky Computing Lab have introduced LLM-as-a-Verifier, a novel test-time verification framework that achieves state-of-the-art performance on two major software engineering benchmarks. The method reaches 86.4% accuracy on Terminal-Bench 2 and 77.8% on SWE-Bench Verified, surpassing frontier models including Claude Opus 4.6 and GPT 5.4. The framework works by scaling scoring granularity, implementing repeated verification, and decomposing evaluation criteria to provide fine-grained feedback for trajectory reward models.
Unlike traditional LLM-as-a-Judge approaches that rely on coarse discrete scoring tokens, LLM-as-a-Verifier improves discrimination between complex agent trajectories through more nuanced evaluation. The researchers found that standard scoring methods produce ties 27% of the time on Terminal-Bench, limiting their effectiveness. By enhancing verification accuracy to 78.9% pairwise discrimination, the method improves downstream success rates from 81.8% to 86.4% through test-time scaling.
The approach is fully reproducible and publicly available on GitHub. It leverages existing scaffolds like ForgeCode and mini-swe-agent, using Claude Opus 4.6 and Gemini models to generate candidate trajectories, with Gemini 2.5 Flash serving as the verifier component.
- The method is fully reproducible and available on GitHub for academic and developer use
Editorial Opinion
LLM-as-a-Verifier represents an important advance in test-time scaling techniques for AI agents, demonstrating that smarter verification can unlock better performance without requiring larger or more capable base models. The framework's focus on improving discrimination between trajectories through nuanced scoring is a pragmatic approach to a real problem in LLM evaluation. If these results hold up across diverse tasks, this could influence how both research institutions and commercial AI companies approach prompt engineering and reward modeling for complex reasoning tasks.



