GPT-5.5 Hallucinates at 6x the Rate of Anthropic Models on Degraded Documents, New Benchmark Shows
Key Takeaways
- ▸GPT-5.5 hallucinates numeric values at 6x the rate of Opus 4.7 on degraded documents (2.6–6.5x across different effort levels)
- ▸OpenAI models produce internally consistent but fabricated data that defeats simple validation checks; Anthropic models return null when uncertain
- ▸GPT-5.5 shows no improvement over GPT-5.4 at higher reasoning effort levels, with hallucination rates remaining flat or worsening
Summary
A new independent benchmark study reveals that OpenAI's GPT-5.5 model fabricates numeric values at 2.6 to 6.5 times the rate of Anthropic's Opus 4.7 and Sonnet 4.6 models when processing visually degraded documents. The research, which tested five models on 148 real-world insurance and financial documents with rendering issues, found a critical difference in failure modes: while Anthropic's models return null values when unable to confidently extract data, OpenAI's models generate plausible but incorrect numeric values that maintain internal consistency, making them harder to detect through standard validation checks.
The hallucination problem is particularly pronounced in GPT-5.5 (released April 23), which shows no improvement over GPT-5.4 even when running at higher reasoning effort levels. In one example, GPT-5.5 reported revenue of $405.86M for a $95M financial statement—off by over $310M—while maintaining perfect internal consistency across all income statement line items. This internal consistency is the critical issue: the fabricated numbers align with each other mathematically, defeating simple schema and arithmetic validation approaches.
The benchmark included Google's Gemini 3.1 Pro, which notably was the only model to correctly read corrupted text in some test cases. However, the study highlights a fundamental difference in model behavior: Anthropic's conservative approach of returning null when uncertain is positioned as superior to OpenAI's tendency to generate confident but false numeric outputs. The author argues that detecting these hallucinations requires adversarial ground truth testing and paired-model comparison scoring, not standard exact-match evaluation.
- Standard exact-match validation cannot catch these hallucinations; adversarial testing with ground truth is required for detection
Editorial Opinion
This benchmark exposes a critical reliability gap in production AI systems. While GPT-5.5's internally consistent fabrications might fool simple validation systems, Anthropic's conservative null-return approach represents a more trustworthy strategy for high-stakes applications like financial document processing. The fact that GPT-5.5 failed to improve at higher reasoning effort levels—when GPT-5.4 did—suggests OpenAI's newer model may have regressed on this critical dimension.



