Research Reveals LLMs Corrupt Documents During Delegated Work — Major Models Fail at Reliability

Key Takeaways

▸Frontier LLMs (including Claude 4.6 Opus, GPT 5.4, Gemini 3.1 Pro) corrupt ~25% of document content in long delegated workflows
▸Document degradation is silent and severe—errors compound over time without explicit warnings to users
▸Agentic tool use does not mitigate corruption, and larger documents, longer interactions, and distractor files exacerbate the problem

Source:

Hacker Newshttps://arxiv.org/abs/2604.15597↗

Summary

A new arXiv research paper titled "LLMs Corrupt Your Documents When You Delegate" challenges the readiness of current AI systems for real-world delegation workflows. The study, which introduces the DELEGATE-52 benchmark, evaluates 19 LLMs across 52 professional domains (coding, crystallography, music notation, and more) in long, complex document-editing workflows. The findings are sobering: frontier models including GPT 5.4 (OpenAI), Gemini 3.1 Pro (Google), and Claude 4.6 Opus (Anthropic) corrupt an average of 25% of document content by the end of extended workflows, with other models performing even worse.

The research reveals that document degradation is not isolated to weaker models—even state-of-the-art frontier systems silently introduce sparse but severe errors that compound over time. Additional experiments show that agentic tool use does not improve performance, and that degradation worsens with document size, interaction length, and the presence of distractor files. The authors conclude that current LLMs are unreliable delegates, raising critical questions about their trustworthiness for knowledge work automation and the emerging "vibe coding" paradigm.

Current LLMs lack reliability for trust-critical delegation tasks across professional domains like coding, legal, and scientific work

Editorial Opinion

This research is a wake-up call for the AI industry. As enterprises and developers rush to delegate real work to LLMs, the DELEGATE-52 findings expose a critical gap between capability and reliability. The 25% document corruption rate in frontier models should spark urgent focus on robustness, verification, and user safeguards—not just raw performance metrics. Until LLMs can be trusted to edit documents without silent corruption, delegation will remain a risky proposition for knowledge work.

Anthropic

RESEARCH Anthropic2026-04-30

Research Reveals LLMs Corrupt Documents During Delegated Work — Major Models Fail at Reliability

Key Takeaways

▸Frontier LLMs (including Claude 4.6 Opus, GPT 5.4, Gemini 3.1 Pro) corrupt ~25% of document content in long delegated workflows
▸Document degradation is silent and severe—errors compound over time without explicit warnings to users
▸Agentic tool use does not mitigate corruption, and larger documents, longer interactions, and distractor files exacerbate the problem

Source:

Hacker Newshttps://arxiv.org/abs/2604.15597↗

Summary

Current LLMs lack reliability for trust-critical delegation tasks across professional domains like coding, legal, and scientific work

Editorial Opinion

This research is a wake-up call for the AI industry. As enterprises and developers rush to delegate real work to LLMs, the DELEGATE-52 findings expose a critical gap between capability and reliability. The 25% document corruption rate in frontier models should spark urgent focus on robustness, verification, and user safeguards—not just raw performance metrics. Until LLMs can be trusted to edit documents without silent corruption, delegation will remain a risky proposition for knowledge work.

Research Reveals LLMs Corrupt Documents During Delegated Work — Major Models Fail at Reliability

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic's Claude Model Deletes PocketOS Production Database in 9 Seconds; AI Agent Admits Violating Safety Rules

Anthropic Researcher Argues Capability Restraint Is Critical for Safe AI Development

Anthropic Launches Lens Agents: Enterprise-Grade Governance Platform for AI Agents

Comments

Suggested

Anthropic's Claude Model Deletes PocketOS Production Database in 9 Seconds; AI Agent Admits Violating Safety Rules

Google DeepMind Launches AI Co-Clinician Research Initiative to Support Medical Decision-Making

The More Young People Use AI, the More They Hate It

Research Reveals LLMs Corrupt Documents During Delegated Work — Major Models Fail at Reliability

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic's Claude Model Deletes PocketOS Production Database in 9 Seconds; AI Agent Admits Violating Safety Rules

Anthropic Researcher Argues Capability Restraint Is Critical for Safe AI Development

Anthropic Launches Lens Agents: Enterprise-Grade Governance Platform for AI Agents

Comments

Suggested

Anthropic's Claude Model Deletes PocketOS Production Database in 9 Seconds; AI Agent Admits Violating Safety Rules

Google DeepMind Launches AI Co-Clinician Research Initiative to Support Medical Decision-Making

The More Young People Use AI, the More They Hate It