Composo Open-Sources LLM-as-Judge Technique, Achieving 83.6% on RewardBench 2

Key Takeaways

▸Criteria injection combined with k=8 ensembling achieves 83.6% accuracy on RewardBench 2—a strong baseline requiring minimal complexity
▸Advanced techniques (calibration, model routing, soft blending) did not consistently improve results, suggesting simplicity and efficiency often outperform added complexity
▸The open-source release includes reproducible code, data collection scripts, and full technical methodology, lowering barriers for researchers to implement and build upon this work

Source:

Hacker Newshttps://github.com/composo-ai/llm-judge-criteria-ensembling↗

Summary

Composo AI has open-sourced its research on optimizing LLM-as-Judge techniques, achieving 83.6% accuracy on RewardBench 2 through a combination of criteria injection and ensembling. The work, detailed in a systematic evaluation paper by Ryan Lail, tested five candidate techniques and found that a simple one-sentence task-specific criterion paired with k=8 ensembling delivered the best results at its cost level. The research reveals that more complex approaches like calibration, model routing, and soft blending did not reliably improve upon this straightforward method.

The open-source release includes complete code, experimental methodology, and reproducible data collection scripts built on Azure OpenAI's GPT-5.4 models. The project provides a unified collection framework that minimizes API costs while enabling researchers to derive multiple experimental conditions from a single run. All code is publicly available with full technical documentation, making the technique accessible to the broader AI community for building more accurate reward models and evaluation systems.

The efficient experimental design allows deriving multiple conditions offline from single API runs, reducing computational cost while maintaining research rigor

Editorial Opinion

Composo's contribution is valuable precisely because it demonstrates that effective LLM-as-Judge performance doesn't require architectural complexity or exotic techniques—a focused, well-engineered approach wins. By open-sourcing both the method and the experimental framework, they've created a reproducible baseline that will likely accelerate progress in reward modeling and preference optimization across the industry. This is the kind of practical, cost-conscious research that moves the field forward.

Composo Open-Sources LLM-as-Judge Technique, Achieving 83.6% on RewardBench 2

Key Takeaways

▸Criteria injection combined with k=8 ensembling achieves 83.6% accuracy on RewardBench 2—a strong baseline requiring minimal complexity
▸Advanced techniques (calibration, model routing, soft blending) did not consistently improve results, suggesting simplicity and efficiency often outperform added complexity
▸The open-source release includes reproducible code, data collection scripts, and full technical methodology, lowering barriers for researchers to implement and build upon this work

Summary

The efficient experimental design allows deriving multiple conditions offline from single API runs, reducing computational cost while maintaining research rigor

Editorial Opinion

Composo's contribution is valuable precisely because it demonstrates that effective LLM-as-Judge performance doesn't require architectural complexity or exotic techniques—a focused, well-engineered approach wins. By open-sourcing both the method and the experimental framework, they've created a reproducible baseline that will likely accelerate progress in reward modeling and preference optimization across the industry. This is the kind of practical, cost-conscious research that moves the field forward.

Composo Open-Sources LLM-as-Judge Technique, Achieving 83.6% on RewardBench 2

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Literary Prize Scandal Exposes Limitations of AI Detection Tools

Composo Open-Sources LLM-as-Judge Technique, Achieving 83.6% on RewardBench 2

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Literary Prize Scandal Exposes Limitations of AI Detection Tools