New Analysis Reveals Google's AI Overviews Generates Hundreds of Thousands of Incorrect Answers Daily
Key Takeaways
- ▸AI Overviews achieves 90% accuracy on SimpleQA benchmarks, but generates tens of millions of incorrect answers daily when scaled across all Google searches
- ▸Google uses multiple models dynamically, often defaulting to faster but less accurate Gemini Flash variants to maintain search performance
- ▸Google contests the SimpleQA evaluation methodology, claiming it contains inaccurate information and doesn't reflect typical user searches
Summary
A new accuracy assessment of Google's AI Overviews, conducted by The New York Times in partnership with startup Oumi, found that the Gemini-powered search feature answers questions correctly 90 percent of the time—but the inverse means it produces incorrect answers at scale. Using OpenAI's SimpleQA benchmark, the analysis showed AI Overviews achieved 91 percent accuracy on Gemini 3, an improvement from 85 percent on the earlier Gemini 2.5 model. However, when extrapolated across all Google searches, this 9 percent error rate translates to tens of millions of incorrect answers per day.
The study documented specific examples of AI Overviews' failures, including confidently providing wrong dates for historical facts and contradicting verified information from authoritative sources like Wikipedia and official organization websites. Google disputed the findings, arguing that SimpleQA contains inaccuracies and doesn't reflect real-world search behavior, and noting that it prefers its own SimpleQA Verified benchmark with a more limited, vetted question set.
Google also revealed that AI Overviews doesn't use a single model but dynamically selects the appropriate model for each query—often defaulting to faster (and less accurate) Gemini Flash models to maintain search speed, rather than always using the more capable Gemini 3.1 Pro. This trade-off between speed and accuracy underscores the engineering challenges of deploying AI at Google's massive scale.
- Model evaluation in generative AI remains inconsistent across the industry, with companies preferring different benchmarks and metrics to demonstrate performance
Editorial Opinion
While 90 percent accuracy might sound acceptable in isolation, the scale of Google's search operations means even single-digit error rates translate to millions of false claims reaching users daily. The tension between speed and accuracy reveals a fundamental trade-off in deployed AI systems—Google's choice to use cheaper, faster models for most queries suggests the company prioritizes user experience over factual reliability. Google's dismissal of SimpleQA as a flawed benchmark feels defensive; regardless of the test's limitations, the core issue remains: AI Overviews confidently presents false information to users who may treat it as authoritative.



