BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-04-27

GPT-5.5 Hallucinates at 6x the Rate of Anthropic Models on Degraded Documents, New Benchmark Shows

Key Takeaways

  • ▸GPT-5.5 hallucinates numeric values at 6x the rate of Opus 4.7 on degraded documents (2.6–6.5x across different effort levels)
  • ▸OpenAI models produce internally consistent but fabricated data that defeats simple validation checks; Anthropic models return null when uncertain
  • ▸GPT-5.5 shows no improvement over GPT-5.4 at higher reasoning effort levels, with hallucination rates remaining flat or worsening
Source:
Hacker Newshttps://aginor.ai/extraction-test/↗

Summary

A new independent benchmark study reveals that OpenAI's GPT-5.5 model fabricates numeric values at 2.6 to 6.5 times the rate of Anthropic's Opus 4.7 and Sonnet 4.6 models when processing visually degraded documents. The research, which tested five models on 148 real-world insurance and financial documents with rendering issues, found a critical difference in failure modes: while Anthropic's models return null values when unable to confidently extract data, OpenAI's models generate plausible but incorrect numeric values that maintain internal consistency, making them harder to detect through standard validation checks.

The hallucination problem is particularly pronounced in GPT-5.5 (released April 23), which shows no improvement over GPT-5.4 even when running at higher reasoning effort levels. In one example, GPT-5.5 reported revenue of $405.86M for a $95M financial statement—off by over $310M—while maintaining perfect internal consistency across all income statement line items. This internal consistency is the critical issue: the fabricated numbers align with each other mathematically, defeating simple schema and arithmetic validation approaches.

The benchmark included Google's Gemini 3.1 Pro, which notably was the only model to correctly read corrupted text in some test cases. However, the study highlights a fundamental difference in model behavior: Anthropic's conservative approach of returning null when uncertain is positioned as superior to OpenAI's tendency to generate confident but false numeric outputs. The author argues that detecting these hallucinations requires adversarial ground truth testing and paired-model comparison scoring, not standard exact-match evaluation.

  • Standard exact-match validation cannot catch these hallucinations; adversarial testing with ground truth is required for detection

Editorial Opinion

This benchmark exposes a critical reliability gap in production AI systems. While GPT-5.5's internally consistent fabrications might fool simple validation systems, Anthropic's conservative null-return approach represents a more trustworthy strategy for high-stakes applications like financial document processing. The fact that GPT-5.5 failed to improve at higher reasoning effort levels—when GPT-5.4 did—suggests OpenAI's newer model may have regressed on this critical dimension.

Large Language Models (LLMs)Generative AIFinance & FintechEthics & Bias

More from OpenAI

OpenAIOpenAI
RESEARCH

Real-World Testing Reveals GPT 5.5's Token Efficiency Edge Over Claude Opus 4.7

2026-04-27
OpenAIOpenAI
PARTNERSHIP

OpenAI and Microsoft Expand Partnership with Cross-Cloud Services Strategy

2026-04-27
OpenAIOpenAI
RESEARCH

Independent Testing Reveals GPT-5.5 Pro's Math Capabilities: How the $200 Tier Performs on PhD-Level Problems

2026-04-27

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Google's A2A Protocol: How AI Agents Will Talk to Each Other

2026-04-27
GitHubGitHub
UPDATE

GitHub Copilot Transitions to Usage-Based Billing Model

2026-04-27
MicrosoftMicrosoft
FUNDING & BUSINESS

Microsoft to Invest $18B in Australia to Expand AI and Cloud Infrastructure

2026-04-27
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us