Mantic Achieves Superforecaster-Level Accuracy by Fine-Tuning LLMs with Reinforcement Learning
Key Takeaways
- ▸Fine-tuning LLMs specifically for forecasting using reinforcement learning can match or exceed the performance of frontier off-the-shelf models
- ▸AI forecasting systems have reached superforecaster-level accuracy on geopolitical and current affairs predictions, with implications for decision-making across government and industry
- ▸A two-phase architecture combining research agents for context gathering with specialized prediction tools proves effective for judgmental forecasting tasks
Summary
Mantic has demonstrated that fine-tuning large language models specifically for forecasting using reinforcement learning can significantly improve prediction accuracy on geopolitical and current events questions. Using a framework called Tinker, the team trained a model on approximately 10,000 binary forecasting questions, rewarding the model for assigning higher probability to correct real-world outcomes. The resulting fine-tuned model achieved performance comparable to frontier LLMs like GPT-5 and Gemini 3, despite starting with lower initial capabilities.
The research reveals that AI forecasting systems are now approaching superforecaster-level accuracy, with the best systems rivaling top human forecasters in tournaments like the Metaculus Cup. Mantic's two-phase architecture—combining deep research agents that gather contextual information with a specialized prediction phase—appears to be key to the approach's success. When integrated into an ensemble with other models, the fine-tuned system becomes one of the most important contributors, offering decorrelated predictions that improve overall accuracy.
- Ensemble models incorporating both fine-tuned and frontier LLMs provide superior performance, with the fine-tuned model adding uncorrelated predictions
Editorial Opinion
This research demonstrates a compelling case for task-specific model optimization: even models starting from lower baselines can achieve state-of-the-art performance when fine-tuned on relevant data with appropriate reward signals. The convergence of multiple forecasting approaches toward superforecaster-level accuracy suggests AI is genuinely improving decision-making capabilities in high-stakes domains like geopolitics. However, the implications warrant careful consideration—as these systems become more influential in government and policy decisions, understanding their failure modes and maintaining human oversight becomes increasingly critical.


