Mantic Achieves Superforecaster-Level Accuracy by Fine-Tuning LLMs with Reinforcement Learning

Key Takeaways

▸Fine-tuning LLMs specifically for forecasting using reinforcement learning can match or exceed the performance of frontier off-the-shelf models
▸AI forecasting systems have reached superforecaster-level accuracy on geopolitical and current affairs predictions, with implications for decision-making across government and industry
▸A two-phase architecture combining research agents for context gathering with specialized prediction tools proves effective for judgmental forecasting tasks

Source:

Hacker Newshttps://thinkingmachines.ai/news/training-llms-to-predict-world-events/↗

Summary

Mantic has demonstrated that fine-tuning large language models specifically for forecasting using reinforcement learning can significantly improve prediction accuracy on geopolitical and current events questions. Using a framework called Tinker, the team trained a model on approximately 10,000 binary forecasting questions, rewarding the model for assigning higher probability to correct real-world outcomes. The resulting fine-tuned model achieved performance comparable to frontier LLMs like GPT-5 and Gemini 3, despite starting with lower initial capabilities.

The research reveals that AI forecasting systems are now approaching superforecaster-level accuracy, with the best systems rivaling top human forecasters in tournaments like the Metaculus Cup. Mantic's two-phase architecture—combining deep research agents that gather contextual information with a specialized prediction phase—appears to be key to the approach's success. When integrated into an ensemble with other models, the fine-tuned system becomes one of the most important contributors, offering decorrelated predictions that improve overall accuracy.

Ensemble models incorporating both fine-tuned and frontier LLMs provide superior performance, with the fine-tuned model adding uncorrelated predictions

Editorial Opinion

This research demonstrates a compelling case for task-specific model optimization: even models starting from lower baselines can achieve state-of-the-art performance when fine-tuned on relevant data with appropriate reward signals. The convergence of multiple forecasting approaches toward superforecaster-level accuracy suggests AI is genuinely improving decision-making capabilities in high-stakes domains like geopolitics. However, the implications warrant careful consideration—as these systems become more influential in government and policy decisions, understanding their failure modes and maintaining human oversight becomes increasingly critical.

Mantic Achieves Superforecaster-Level Accuracy by Fine-Tuning LLMs with Reinforcement Learning

Key Takeaways

▸Fine-tuning LLMs specifically for forecasting using reinforcement learning can match or exceed the performance of frontier off-the-shelf models
▸AI forecasting systems have reached superforecaster-level accuracy on geopolitical and current affairs predictions, with implications for decision-making across government and industry
▸A two-phase architecture combining research agents for context gathering with specialized prediction tools proves effective for judgmental forecasting tasks

Summary

Ensemble models incorporating both fine-tuned and frontier LLMs provide superior performance, with the fine-tuned model adding uncorrelated predictions

Editorial Opinion

This research demonstrates a compelling case for task-specific model optimization: even models starting from lower baselines can achieve state-of-the-art performance when fine-tuned on relevant data with appropriate reward signals. The convergence of multiple forecasting approaches toward superforecaster-level accuracy suggests AI is genuinely improving decision-making capabilities in high-stakes domains like geopolitics. However, the implications warrant careful consideration—as these systems become more influential in government and policy decisions, understanding their failure modes and maintaining human oversight becomes increasingly critical.

Mantic Achieves Superforecaster-Level Accuracy by Fine-Tuning LLMs with Reinforcement Learning

Key Takeaways

Summary

Editorial Opinion

More from Mantic

SemanticForge: Open-Source Framework Enables Communities to Define and Verify AI Values Across Cultures

Mantic Demonstrates Fine-Tuned LLMs Outperform Frontier Models in Geopolitical Forecasting

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

Mantic Achieves Superforecaster-Level Accuracy by Fine-Tuning LLMs with Reinforcement Learning

Key Takeaways

Summary

Editorial Opinion

More from Mantic

SemanticForge: Open-Source Framework Enables Communities to Define and Verify AI Values Across Cultures

Mantic Demonstrates Fine-Tuned LLMs Outperform Frontier Models in Geopolitical Forecasting

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale