Leading LLMs Fail on Local Search: Even Best Performers Get 1 in 12 Queries Wrong, Study Finds
Key Takeaways
- ▸OpenAI's GPT leads the benchmark by 4.3 points, but even the best performer fails on ~8% of local search queries, recommending closed, fabricated, or incorrectly located businesses
- ▸Claude hallucinates 20% of recommended places when operating without search; models universally fail to detect permanently closed businesses despite confident recommendations
- ▸Web search paradox: enabling search improves discovery tasks (+10-21 points) but hurts transactional tasks like booking (-5+ points for Claude and Gemini), as models get distracted by facts instead of providing guidance
Summary
A comprehensive benchmark of four leading large language models tested on 345 real-world local search queries across 50+ cities reveals significant accuracy gaps, despite their ability to pass bar exams and write poetry. The study, which evaluated Claude, GPT, Gemini, and Perplexity with and without web search capabilities, found that even the top performer recommends non-existent, permanently closed, or incorrectly located businesses 8% of the time—roughly 1 in 12 queries.
The research uncovered critical failure patterns that pose real risks for AI-powered applications. Without search enabled, Claude fabricates 20% of recommended places entirely, while even Perplexity—a search-native product—maintains a 12% failure rate. Perhaps most concerning is the "permanently closed blind spot": all tested models cheerfully provided booking guidance for a shuttered Buenos Aires restaurant, demonstrating that current LLMs struggle to detect when businesses have closed.
The study also revealed a counterintuitive finding: enabling web search sometimes makes models worse, particularly on transactional tasks like booking reservations. Claude and Gemini lost 5+ points in performance when search was enabled, as models became distracted by retrieved content rather than providing actionable guidance. OpenAI emerged as the strongest performer overall, scoring 90+ on 7 of 10 task categories, while Perplexity scored below 70 on nearly 25% of queries.
- Each LLM has distinct strengths by task type—no single provider excels across all categories, making worst-case performance rather than average metrics the true differentiator for production applications
- The benchmark reveals that LLM capabilities in complex reasoning (bar exams, poetry) don't translate to reliability in grounded, real-world factual retrieval tasks critical for consumer applications
Editorial Opinion
This benchmark exposes a critical gap between LLM hype and real-world reliability. While models excel at abstract reasoning tasks like bar exams, their poor performance on local search—a seemingly simpler, high-stakes task—raises serious questions about deploying these systems in customer-facing applications without verification layers. The finding that web search sometimes makes models worse is particularly telling: it suggests that current LLM architectures struggle to integrate external information thoughtfully, a fundamental problem for any AI system claiming to address hallucination. Products built on these APIs must implement robust fact-checking guardrails, not just rely on model improvements.

