Composo Open-Sources LLM-as-Judge Technique, Achieving 83.6% on RewardBench 2
Key Takeaways
- ▸Criteria injection combined with k=8 ensembling achieves 83.6% accuracy on RewardBench 2—a strong baseline requiring minimal complexity
- ▸Advanced techniques (calibration, model routing, soft blending) did not consistently improve results, suggesting simplicity and efficiency often outperform added complexity
- ▸The open-source release includes reproducible code, data collection scripts, and full technical methodology, lowering barriers for researchers to implement and build upon this work
Summary
Composo AI has open-sourced its research on optimizing LLM-as-Judge techniques, achieving 83.6% accuracy on RewardBench 2 through a combination of criteria injection and ensembling. The work, detailed in a systematic evaluation paper by Ryan Lail, tested five candidate techniques and found that a simple one-sentence task-specific criterion paired with k=8 ensembling delivered the best results at its cost level. The research reveals that more complex approaches like calibration, model routing, and soft blending did not reliably improve upon this straightforward method.
The open-source release includes complete code, experimental methodology, and reproducible data collection scripts built on Azure OpenAI's GPT-5.4 models. The project provides a unified collection framework that minimizes API costs while enabling researchers to derive multiple experimental conditions from a single run. All code is publicly available with full technical documentation, making the technique accessible to the broader AI community for building more accurate reward models and evaluation systems.
- The efficient experimental design allows deriving multiple conditions offline from single API runs, reducing computational cost while maintaining research rigor
Editorial Opinion
Composo's contribution is valuable precisely because it demonstrates that effective LLM-as-Judge performance doesn't require architectural complexity or exotic techniques—a focused, well-engineered approach wins. By open-sourcing both the method and the experimental framework, they've created a reproducible baseline that will likely accelerate progress in reward modeling and preference optimization across the industry. This is the kind of practical, cost-conscious research that moves the field forward.



