GPT-5.5 Hallucinates at 6x the Rate of Anthropic Models on Degraded Documents, New Benchmark Shows

Key Takeaways

▸GPT-5.5 hallucinates numeric values at 6x the rate of Opus 4.7 on degraded documents (2.6–6.5x across different effort levels)
▸OpenAI models produce internally consistent but fabricated data that defeats simple validation checks; Anthropic models return null when uncertain
▸GPT-5.5 shows no improvement over GPT-5.4 at higher reasoning effort levels, with hallucination rates remaining flat or worsening

Source:

Hacker Newshttps://aginor.ai/extraction-test/↗

Summary

A new independent benchmark study reveals that OpenAI's GPT-5.5 model fabricates numeric values at 2.6 to 6.5 times the rate of Anthropic's Opus 4.7 and Sonnet 4.6 models when processing visually degraded documents. The research, which tested five models on 148 real-world insurance and financial documents with rendering issues, found a critical difference in failure modes: while Anthropic's models return null values when unable to confidently extract data, OpenAI's models generate plausible but incorrect numeric values that maintain internal consistency, making them harder to detect through standard validation checks.

The hallucination problem is particularly pronounced in GPT-5.5 (released April 23), which shows no improvement over GPT-5.4 even when running at higher reasoning effort levels. In one example, GPT-5.5 reported revenue of $405.86M for a $95M financial statement—off by over $310M—while maintaining perfect internal consistency across all income statement line items. This internal consistency is the critical issue: the fabricated numbers align with each other mathematically, defeating simple schema and arithmetic validation approaches.

The benchmark included Google's Gemini 3.1 Pro, which notably was the only model to correctly read corrupted text in some test cases. However, the study highlights a fundamental difference in model behavior: Anthropic's conservative approach of returning null when uncertain is positioned as superior to OpenAI's tendency to generate confident but false numeric outputs. The author argues that detecting these hallucinations requires adversarial ground truth testing and paired-model comparison scoring, not standard exact-match evaluation.

Standard exact-match validation cannot catch these hallucinations; adversarial testing with ground truth is required for detection

Editorial Opinion

This benchmark exposes a critical reliability gap in production AI systems. While GPT-5.5's internally consistent fabrications might fool simple validation systems, Anthropic's conservative null-return approach represents a more trustworthy strategy for high-stakes applications like financial document processing. The fact that GPT-5.5 failed to improve at higher reasoning effort levels—when GPT-5.4 did—suggests OpenAI's newer model may have regressed on this critical dimension.

GPT-5.5 Hallucinates at 6x the Rate of Anthropic Models on Degraded Documents, New Benchmark Shows

Key Takeaways

▸GPT-5.5 hallucinates numeric values at 6x the rate of Opus 4.7 on degraded documents (2.6–6.5x across different effort levels)
▸OpenAI models produce internally consistent but fabricated data that defeats simple validation checks; Anthropic models return null when uncertain
▸GPT-5.5 shows no improvement over GPT-5.4 at higher reasoning effort levels, with hallucination rates remaining flat or worsening

Summary

Standard exact-match validation cannot catch these hallucinations; adversarial testing with ground truth is required for detection

Editorial Opinion

This benchmark exposes a critical reliability gap in production AI systems. While GPT-5.5's internally consistent fabrications might fool simple validation systems, Anthropic's conservative null-return approach represents a more trustworthy strategy for high-stakes applications like financial document processing. The fact that GPT-5.5 failed to improve at higher reasoning effort levels—when GPT-5.4 did—suggests OpenAI's newer model may have regressed on this critical dimension.

GPT-5.5 Hallucinates at 6x the Rate of Anthropic Models on Degraded Documents, New Benchmark Shows

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

Real-World Testing Reveals GPT 5.5's Token Efficiency Edge Over Claude Opus 4.7

OpenAI and Microsoft Expand Partnership with Cross-Cloud Services Strategy

Independent Testing Reveals GPT-5.5 Pro's Math Capabilities: How the $200 Tier Performs on PhD-Level Problems

Comments

Suggested

Google's A2A Protocol: How AI Agents Will Talk to Each Other

GitHub Copilot Transitions to Usage-Based Billing Model

Microsoft to Invest $18B in Australia to Expand AI and Cloud Infrastructure

GPT-5.5 Hallucinates at 6x the Rate of Anthropic Models on Degraded Documents, New Benchmark Shows

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

Real-World Testing Reveals GPT 5.5's Token Efficiency Edge Over Claude Opus 4.7

OpenAI and Microsoft Expand Partnership with Cross-Cloud Services Strategy

Independent Testing Reveals GPT-5.5 Pro's Math Capabilities: How the $200 Tier Performs on PhD-Level Problems

Comments

Suggested

Google's A2A Protocol: How AI Agents Will Talk to Each Other

GitHub Copilot Transitions to Usage-Based Billing Model

Microsoft to Invest $18B in Australia to Expand AI and Cloud Infrastructure