Comprehensive Benchmark of 16 AI Models on 9,000+ Real Documents Reveals Surprising Performance Insights
Key Takeaways
- ▸Gemini 3.1 Pro significantly outperforms other models on document VQA tasks (85 vs. GPT-5.4's 78.2), but this advantage doesn't extend uniformly across all document AI tasks
- ▸Cheaper models (Sonnet 4.6, Gemini-3 Flash) demonstrate competitive or superior performance on extraction tasks compared to more expensive alternatives, suggesting cost-effective options exist for many real-world use cases
- ▸The interactive Results Explorer and 1v1 comparison tool provide transparency into model failures and hallucinations on actual documents, enabling users to make informed decisions based on their specific use cases rather than aggregate scores
Summary
A detailed analysis of 16 AI models tested on over 9,000 real documents has revealed nuanced performance differences across document understanding tasks. Researchers created the Intelligent Document Processing (IDP) Leaderboard with three benchmarks—OlmOCR Bench, OmniDocBench, and IDP Core—measuring critical capabilities like OCR, table extraction, key information extraction, visual QA, and long document understanding. The findings challenge conventional wisdom: Google's Gemini 3.1 Pro dominates visual question-answering tasks with an 85 score, while surprisingly, cheaper models like Claude Sonnet 4.6 match or exceed their more expensive counterparts on extraction tasks. The research introduces an interactive Results Explorer allowing practitioners to compare model outputs on actual documents, rather than relying on a single composite score.
- No single model excels across all benchmarks—the #7 ranked model outperforms #1 on certain tasks, highlighting the importance of task-specific evaluation over generic leaderboard rankings
Editorial Opinion
This research demonstrates the critical importance of task-specific benchmarking over generic leaderboard scores in AI evaluation. By testing on 9,000+ real documents and providing interactive comparison tools, the researchers have created a resource that acknowledges a fundamental truth: different models excel at different aspects of document understanding, and practitioners need transparent, hands-on visibility into actual performance rather than single composite metrics. The finding that cheaper models match expensive ones on extraction tasks could have significant implications for cost optimization in document processing pipelines.


