Researchers Develop Automated System for Generating and Resolving AI Forecasting Questions at Scale

Key Takeaways

▸Automated system achieves 96% verifiable question quality, surpassing human-curated benchmarks like Metaculus
▸LLM-powered web research agents enable generation of diverse, real-world forecasting questions at scale without reliance on limited recurring data sources
▸Clear performance differentiation observed across LLMs, enabling robust benchmarking and evaluation of AI forecasting capabilities

Source:

Hacker Newshttps://arxiv.org/abs/2601.22444↗

Summary

A new research system uses LLM-powered web research agents to automatically generate and resolve high-quality forecasting questions for evaluating AI systems. The approach addresses a critical bottleneck in AI evaluation by replacing manual curation with an automated pipeline capable of producing diverse, real-world questions at scale.

The researchers generated 1,499 diverse forecasting questions and resolved them several months later, achieving approximately 96% verifiable question quality—exceeding the rate of Metaculus, a leading human-curated forecasting platform. The system also demonstrated 95% accuracy in question resolution. Testing with different LLM-powered forecasting agents revealed clear performance differentiation, with Gemini 3 Pro achieving a Brier score of 0.134, GPT-5 at 0.149, and Gemini 2.5 Flash at 0.179.

Beyond evaluation, the research demonstrates practical applications for improving forecasting performance. By evaluating a question decomposition strategy on the generated question set, researchers achieved significant improvements in Brier scores (0.132 vs. 0.141), showing how automated question generation can drive iterative refinement of forecasting approaches.

Automated question generation enables direct improvement of forecasting methods through systematic evaluation and iteration

Editorial Opinion

This work represents an important advancement in AI evaluation methodology by automating a labor-intensive and previously bottlenecked process. The system's ability to exceed human-curated benchmark quality while operating at scale could accelerate the development and assessment of AI forecasting capabilities. The demonstration that forecasting performance varies significantly by model and improves with better decomposition strategies suggests this automated framework will become a valuable tool for both evaluation and iterative improvement of AI systems.

Google / Alphabet

RESEARCH Google / Alphabet2026-03-17

Researchers Develop Automated System for Generating and Resolving AI Forecasting Questions at Scale

Key Takeaways

▸Automated system achieves 96% verifiable question quality, surpassing human-curated benchmarks like Metaculus
▸LLM-powered web research agents enable generation of diverse, real-world forecasting questions at scale without reliance on limited recurring data sources
▸Clear performance differentiation observed across LLMs, enabling robust benchmarking and evaluation of AI forecasting capabilities

Source:

Hacker Newshttps://arxiv.org/abs/2601.22444↗

Summary

Automated question generation enables direct improvement of forecasting methods through systematic evaluation and iteration

Editorial Opinion

This work represents an important advancement in AI evaluation methodology by automating a labor-intensive and previously bottlenecked process. The system's ability to exceed human-curated benchmark quality while operating at scale could accelerate the development and assessment of AI forecasting capabilities. The demonstration that forecasting performance varies significantly by model and improves with better decomposition strategies suggests this automated framework will become a valuable tool for both evaluation and iterative improvement of AI systems.

Researchers Develop Automated System for Generating and Resolving AI Forecasting Questions at Scale

Key Takeaways

Summary

Editorial Opinion

More from Google / Alphabet

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

Singapore Inks AI Deals with Google

Google Overhauls Workspace App Icons with Gradient Design to Emphasize AI Integration

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

Researchers Develop Automated System for Generating and Resolving AI Forecasting Questions at Scale

Key Takeaways

Summary

Editorial Opinion

More from Google / Alphabet

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

Singapore Inks AI Deals with Google

Google Overhauls Workspace App Icons with Gradient Design to Emphasize AI Integration

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption