New Hallucination Taxonomy Reveals Why LLMs Fail at Counting: GPT Avoids Tasks, Gemini Confabulates, Claude Hides Its Reasoning
Key Takeaways
- ▸LLMs consistently hallucinate when counting, but three distinct failure patterns emerge: Gemini overcounts (confabulation), ChatGPT abandons tasks (avoidance), and Claude maintains accuracy but obscures its reasoning (process-opacity)
- ▸Claude Sonnet outperforms competitors on raw counting accuracy but exhibits hallucinations that are harder to detect and audit than other models
- ▸The Knowledge Innovation System (KIS) protocol effectively eliminates hallucinations across all tested models, achieving 100% accuracy by structurally separating counting, verification, and reporting phases
Summary
A comprehensive quantitative study by researcher Hiroyasu Hasegawa reveals that large language models fundamentally struggle with counting tasks, but in three distinctly different ways. The research tested GPT-5.3 Instant, Gemini 3 Flash, and Claude Sonnet 4.6 on datasets ranging from 200 to 2,000 items, uncovering what the author terms a three-type hallucination taxonomy: Confabulation Type (Gemini), Avoidance Type (ChatGPT), and Process-Opaque Type (Claude). Notably, Claude maintained perfect accuracy up to 2,000 items without any external protocol, while Gemini overcounted by 38 items and ChatGPT abandoned the task entirely beyond 800 items in baseline conditions.
The research introduces KIS (Knowledge Innovation System), a structured protocol that acts as an external scaffold to mitigate these hallucinations by separating counting, verification, and reporting phases. When KIS was applied, all three models achieved 100% accuracy, and the hybrid approach of KIS combined with Chain-of-Thought prompting recovered ChatGPT's performance. However, the study also reveals a concerning counter-effect: applying Chain-of-Thought prompting alone to ChatGPT actually induced distribution fabrication even at small scales (200 items), suggesting that standard prompt engineering techniques can backfire.
These findings have significant implications for deploying LLMs in production environments where numerical accuracy is critical. While Claude's perfect accuracy without a protocol is noteworthy, the paper raises questions about its Process-Opaque hallucinations—model failures that are difficult to detect because the reasoning process is opaque. The research argues that current LLMs should not be trusted for high-stakes quantitative analysis without external verification systems like KIS, and developers need frameworks to understand and mitigate model-specific failure modes.
- Chain-of-Thought prompting alone can worsen performance in some models (ChatGPT), making prompt engineering techniques unreliable without careful validation
- LLMs are fundamentally unreliable for mission-critical numerical tasks and require external scaffolding or verification systems for production deployment
Editorial Opinion
This research exposes a critical vulnerability in models we're increasingly deploying in high-stakes applications: they cannot reliably count. The three-type hallucination taxonomy is a valuable contribution, as it acknowledges that different models fail in different ways—some catastrophically (ChatGPT abandoning tasks), others insidiously (Claude hiding its reasoning process). While the KIS protocol shows promise, the deeper concern is that we lack reliable ways to audit LLM behavior in numerical and analytical domains. Organizations building systems dependent on accurate counting, financial calculations, or data aggregation must implement external verification mechanisms rather than relying on model accuracy guarantees.


