Researchers Develop Automated System for Generating and Resolving AI Forecasting Questions at Scale
Key Takeaways
- ▸Automated system achieves 96% verifiable question quality, surpassing human-curated benchmarks like Metaculus
- ▸LLM-powered web research agents enable generation of diverse, real-world forecasting questions at scale without reliance on limited recurring data sources
- ▸Clear performance differentiation observed across LLMs, enabling robust benchmarking and evaluation of AI forecasting capabilities
Summary
A new research system uses LLM-powered web research agents to automatically generate and resolve high-quality forecasting questions for evaluating AI systems. The approach addresses a critical bottleneck in AI evaluation by replacing manual curation with an automated pipeline capable of producing diverse, real-world questions at scale.
The researchers generated 1,499 diverse forecasting questions and resolved them several months later, achieving approximately 96% verifiable question quality—exceeding the rate of Metaculus, a leading human-curated forecasting platform. The system also demonstrated 95% accuracy in question resolution. Testing with different LLM-powered forecasting agents revealed clear performance differentiation, with Gemini 3 Pro achieving a Brier score of 0.134, GPT-5 at 0.149, and Gemini 2.5 Flash at 0.179.
Beyond evaluation, the research demonstrates practical applications for improving forecasting performance. By evaluating a question decomposition strategy on the generated question set, researchers achieved significant improvements in Brier scores (0.132 vs. 0.141), showing how automated question generation can drive iterative refinement of forecasting approaches.
- Automated question generation enables direct improvement of forecasting methods through systematic evaluation and iteration
Editorial Opinion
This work represents an important advancement in AI evaluation methodology by automating a labor-intensive and previously bottlenecked process. The system's ability to exceed human-curated benchmark quality while operating at scale could accelerate the development and assessment of AI forecasting capabilities. The demonstration that forecasting performance varies significantly by model and improves with better decomposition strategies suggests this automated framework will become a valuable tool for both evaluation and iterative improvement of AI systems.


