PokeAgent Challenge: Researchers Launch Largest AI Pokemon Tournament as Open Benchmark for Decision-Making
Key Takeaways
- ▸PokeAgent Challenge combines competitive Pokemon battles and RPG speedrunning into a dual-track benchmark designed to test partial observability, game theory, and long-horizon planning simultaneously
- ▸The benchmark includes 20M+ battle trajectories and over 100 NeurIPS 2025 competition submissions, revealing significant gaps between generalist LLM, specialist RL, and elite human performance
- ▸Pokemon battling measures AI capabilities orthogonal to existing LLM benchmarks, positioning it as an unsolved problem space that can drive future RL and LLM research
Summary
Researchers have created and open-sourced the PokeAgent Challenge, a comprehensive benchmark for AI decision-making built around Pokemon's multi-agent battle system and RPG environment. The benchmark features two complementary tracks: a Battling Track focused on strategic reasoning under partial observability in competitive battles, and a Speedrunning Track emphasizing long-horizon planning in the Pokemon RPG. The project includes a dataset of over 20 million battle trajectories, multiple baseline models (heuristic, RL, and LLM-based), and was validated through a NeurIPS 2025 competition that attracted over 100 participating teams.
The benchmark addresses critical gaps in AI research by simultaneously testing partial observability, game-theoretic reasoning, and long-horizon planning—capabilities not adequately measured by existing benchmarks. Analysis shows that Pokemon battling is nearly orthogonal to standard LLM evaluation suites, meaning it measures distinct capabilities that frontier AI models struggle with. The researchers have transitioned PokeAgent into a living benchmark with a public leaderboard for the Battling Track and reproducible evaluation tools for the Speedrunning Track, making it accessible to the broader research community.
- The project is now a living, open-source benchmark with public leaderboards and self-contained evaluation tools, enabling reproducible and modular research comparisons
Editorial Opinion
The PokeAgent Challenge represents a thoughtful approach to benchmark design by leveraging a complex, game-theoretic domain that naturally tests capabilities frontier AI models struggle with—partial observability and long-horizon reasoning. By creating a dual-track system and validating it through a large-scale competition, the researchers have built credibility and community engagement. The finding that Pokemon battling is orthogonal to standard LLM benchmarks highlights a critical blind spot in current AI evaluation and positions this as a genuinely valuable resource for advancing decision-making research beyond language-centric metrics.


