BrowseComp-Plus: New Benchmark for Fair, Transparent Evaluation of Deep-Research Agents
Key Takeaways
- ▸Controlled evaluation with 100K human-verified documents eliminates live web variability, enabling reproducible and fair comparisons across deep-research agents
- ▸Isolates retriever and LLM agent effects, allowing systematic evaluation of how different retrieval methods impact agent performance
- ▸Evaluates leading systems from OpenAI, Anthropic, Google, Qwen, and other major AI companies on a standardized benchmark
Summary
BrowseComp-Plus is a new open-source benchmark designed to enable fair and reproducible evaluation of deep-research agents—AI systems that combine information retrieval with LLM reasoning. Instead of relying on live web search like the original BrowseComp, BrowseComp-Plus evaluates agents against a fixed, curated corpus of approximately 100,000 human-verified documents, providing complete control over the retrieval environment.
The benchmark sources reasoning-intensive queries from OpenAI's BrowseComp but replaces live search variability with a controlled retrieval setting. This enables researchers to systematically isolate and compare the effects of different retrievers paired with the same LLM agent, providing clear insights into which components drive performance differences. The approach addresses a critical methodological gap in evaluating these increasingly sophisticated systems.
BrowseComp-Plus includes a comprehensive evaluation framework, public leaderboard, research paper, and complete project documentation. It evaluates deep-research agents from major AI companies including OpenAI, Anthropic, Google Gemini, Qwen, and others. The full dataset, code, pre-built indexes for BM25 and Qwen embeddings, and detailed guides are available on Hugging Face, enabling researchers to reproduce results, integrate custom retrievers, or evaluate their own implementations.
- Complete open-source release with datasets, code, pre-built indexes, and documentation supports reproducibility and community contributions
Editorial Opinion
BrowseComp-Plus represents a meaningful step toward rigorous evaluation of deep-research agents at a critical moment in AI development. The move from live web search to a controlled corpus is pragmatically sound—while it sacrifices some real-world fidelity, it enables the systematic scientific comparison that the field needs as these systems grow more complex and move toward deployment. The open-source release multiplies its value by letting the community extend and iterate on this foundation.



