BotBeat
...
← Back

> ▌

Google / AlphabetGoogle / Alphabet
RESEARCHGoogle / Alphabet2026-03-17

Researchers Develop Automated System for Generating and Resolving AI Forecasting Questions at Scale

Key Takeaways

  • ▸Automated system achieves 96% verifiable question quality, surpassing human-curated benchmarks like Metaculus
  • ▸LLM-powered web research agents enable generation of diverse, real-world forecasting questions at scale without reliance on limited recurring data sources
  • ▸Clear performance differentiation observed across LLMs, enabling robust benchmarking and evaluation of AI forecasting capabilities
Source:
Hacker Newshttps://arxiv.org/abs/2601.22444↗

Summary

A new research system uses LLM-powered web research agents to automatically generate and resolve high-quality forecasting questions for evaluating AI systems. The approach addresses a critical bottleneck in AI evaluation by replacing manual curation with an automated pipeline capable of producing diverse, real-world questions at scale.

The researchers generated 1,499 diverse forecasting questions and resolved them several months later, achieving approximately 96% verifiable question quality—exceeding the rate of Metaculus, a leading human-curated forecasting platform. The system also demonstrated 95% accuracy in question resolution. Testing with different LLM-powered forecasting agents revealed clear performance differentiation, with Gemini 3 Pro achieving a Brier score of 0.134, GPT-5 at 0.149, and Gemini 2.5 Flash at 0.179.

Beyond evaluation, the research demonstrates practical applications for improving forecasting performance. By evaluating a question decomposition strategy on the generated question set, researchers achieved significant improvements in Brier scores (0.132 vs. 0.141), showing how automated question generation can drive iterative refinement of forecasting approaches.

  • Automated question generation enables direct improvement of forecasting methods through systematic evaluation and iteration

Editorial Opinion

This work represents an important advancement in AI evaluation methodology by automating a labor-intensive and previously bottlenecked process. The system's ability to exceed human-curated benchmark quality while operating at scale could accelerate the development and assessment of AI forecasting capabilities. The demonstration that forecasting performance varies significantly by model and improves with better decomposition strategies suggests this automated framework will become a valuable tool for both evaluation and iterative improvement of AI systems.

Large Language Models (LLMs)AI AgentsMachine LearningData Science & Analytics

More from Google / Alphabet

Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

2026-05-20
Google / AlphabetGoogle / Alphabet
PARTNERSHIP

Singapore Inks AI Deals with Google

2026-05-20
Google / AlphabetGoogle / Alphabet
UPDATE

Google Overhauls Workspace App Icons with Gradient Design to Emphasize AI Integration

2026-05-20

Comments

Suggested

Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

2026-05-20
Executive Office of the President of the United States (Policy/Regulation)Executive Office of the President of the United States (Policy/Regulation)
RESEARCH

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

2026-05-20
OpenAIOpenAI
RESEARCH

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us