BotBeat
...
← Back

> ▌

Composo AIComposo AI
OPEN SOURCEComposo AI2026-04-02

Composo Open-Sources LLM-as-Judge Technique, Achieving 83.6% on RewardBench 2

Key Takeaways

  • ▸Criteria injection combined with k=8 ensembling achieves 83.6% accuracy on RewardBench 2—a strong baseline requiring minimal complexity
  • ▸Advanced techniques (calibration, model routing, soft blending) did not consistently improve results, suggesting simplicity and efficiency often outperform added complexity
  • ▸The open-source release includes reproducible code, data collection scripts, and full technical methodology, lowering barriers for researchers to implement and build upon this work
Source:
Hacker Newshttps://github.com/composo-ai/llm-judge-criteria-ensembling↗

Summary

Composo AI has open-sourced its research on optimizing LLM-as-Judge techniques, achieving 83.6% accuracy on RewardBench 2 through a combination of criteria injection and ensembling. The work, detailed in a systematic evaluation paper by Ryan Lail, tested five candidate techniques and found that a simple one-sentence task-specific criterion paired with k=8 ensembling delivered the best results at its cost level. The research reveals that more complex approaches like calibration, model routing, and soft blending did not reliably improve upon this straightforward method.

The open-source release includes complete code, experimental methodology, and reproducible data collection scripts built on Azure OpenAI's GPT-5.4 models. The project provides a unified collection framework that minimizes API costs while enabling researchers to derive multiple experimental conditions from a single run. All code is publicly available with full technical documentation, making the technique accessible to the broader AI community for building more accurate reward models and evaluation systems.

  • The efficient experimental design allows deriving multiple conditions offline from single API runs, reducing computational cost while maintaining research rigor

Editorial Opinion

Composo's contribution is valuable precisely because it demonstrates that effective LLM-as-Judge performance doesn't require architectural complexity or exotic techniques—a focused, well-engineered approach wins. By open-sourcing both the method and the experimental framework, they've created a reproducible baseline that will likely accelerate progress in reward modeling and preference optimization across the industry. This is the kind of practical, cost-conscious research that moves the field forward.

Large Language Models (LLMs)Natural Language Processing (NLP)Machine LearningResearch

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
PerplexityPerplexity
POLICY & REGULATION

Perplexity's 'Incognito Mode' Called a 'Sham' in Class Action Lawsuit Over Data Sharing with Google and Meta

2026-04-05
Sweden Polytechnic InstituteSweden Polytechnic Institute
RESEARCH

Research Reveals Brevity Constraints Can Improve LLM Accuracy by Up to 26.3%

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us