NVIDIA AI-Q Achieves Top Performance on DeepResearch Benchmarks I and II
Key Takeaways
- ▸NVIDIA AI-Q achieved #1 ranking on both DeepResearch Bench I (55.95) and Bench II (54.50), demonstrating superior performance in research agent evaluation
- ▸The system uses a multi-agent architecture with planner, researcher, and orchestrator components built on NVIDIA NeMo Agent Toolkit and Nemotron 3 LLMs
- ▸AI-Q is fully open and modular, enabling enterprises to own, inspect, and customize the architecture for their specific use cases
Summary
NVIDIA's AI-Q deep research agent has achieved the #1 ranking on both DeepResearch Bench (55.95) and DeepResearch Bench II (54.50), the leading benchmarks for evaluating deep research agents. This accomplishment demonstrates that an open, portable, and developer-accessible architecture can deliver state-of-the-art agentic research capabilities.
AI-Q is an open blueprint for constructing AI agents that reason over enterprise and web data to generate well-cited research responses. The system features a fully modular architecture that enterprises can own, inspect, and customize for specific use cases. The architecture leverages a multi-agent design consisting of a planner, researcher, and orchestrator, all built on NVIDIA's NeMo Agent Toolkit and powered by fine-tuned Nemotron 3 Super models, with optional ensemble and report refinement capabilities.
Winning both benchmarks simultaneously is significant because they evaluate research agents differently but complementarily. DeepResearch Bench I measures report quality dimensions including comprehensiveness, depth of insight, instruction-following, and readability, while DeepResearch Bench II uses 70+ fine-grained binary rubrics to assess information retrieval, analysis synthesis, and presentation clarity. The dual victory confirms that AI-Q produces both polished, well-structured reports and retrieves and reasons over information with granular factual correctness.
- Dual benchmark success indicates the system excels at both report quality and factual correctness with granular analytical rigor
Editorial Opinion
NVIDIA's sweep of both DeepResearch benchmarks validates an important design philosophy: that open, modular architectures using accessible models and tooling can achieve state-of-the-art agentic AI performance. The combination of transparent reasoning (through multi-step planning), specialized agent roles, and fine-tuned models represents a compelling alternative to closed, monolithic systems. This result could accelerate enterprise adoption of AI research agents by demonstrating that transparency and customization don't require sacrificing performance.


