BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
RESEARCHMicrosoft2026-04-27

Microsoft Research Finds Frontier LLMs Corrupt Documents During Long Delegated Workflows

Key Takeaways

  • ▸Microsoft's DELEGATE-52 benchmark reveals that even frontier LLMs corrupt approximately 25% of document content during long delegated workflows across 52 professional domains
  • ▸The study evaluated 19 LLMs including Claude 4.6 Opus, Gemini 3.1 Pro, and GPT 5.4, with all models showing significant document degradation despite their advanced capabilities
  • ▸Agentic tool use does not improve reliability, and document corruption worsens with file size, interaction length, and presence of distractor content, suggesting systematic issues rather than implementation problems
Source:
Hacker Newshttps://arxiv.org/abs/2604.15597↗

Summary

Microsoft researchers have published a new study introducing DELEGATE-52, a comprehensive benchmark that evaluates how reliably Large Language Models can perform autonomous document editing across 52 professional domains. The research tested 19 different LLMs, including frontier models such as Claude 4.6 Opus, Gemini 3.1 Pro, and GPT 5.4, revealing a critical limitation: even the most advanced models corrupt approximately 25% of document content by the end of long interaction workflows.

The benchmark simulates realistic delegated workflows spanning diverse professional domains including coding, crystallography, and music notation. The researchers discovered that current LLMs silently introduce sparse but severe errors that compound over extended interactions. Notably, the use of agentic tool use did not improve performance, and document degradation worsened with larger files, longer workflows, and the presence of distractor content.

These findings have significant implications for organizations considering AI-powered automation of knowledge work. The study demonstrates that current LLMs cannot be trusted for autonomous document editing without human oversight, as silent corruption could introduce subtle but damaging errors into critical professional documents. The research effectively challenges the readiness of current AI systems for true delegated workflows where humans depend on LLMs to faithfully execute complex tasks.

  • Current LLMs cannot be reliably trusted for autonomous document editing and cannot serve as faithful delegates in knowledge work without human oversight and validation

Editorial Opinion

This research exposes a critical gap between frontier LLM capabilities on controlled benchmarks and their actual reliability in production workflows. The 25% document corruption rate among state-of-the-art models is alarming and should significantly temper enthusiasm for AI-driven knowledge work automation. The fact that agentic tool use provides no improvement suggests this is a fundamental limitation of current model architectures rather than a solvable engineering problem. Organizations considering delegating document editing to AI systems must treat this as a wake-up call to invest heavily in human oversight and validation mechanisms before adopting these systems at scale.

Large Language Models (LLMs)Generative AIAI AgentsAI Safety & Alignment

More from Microsoft

MicrosoftMicrosoft
FUNDING & BUSINESS

Microsoft to Invest $18B in Australia to Expand AI and Cloud Infrastructure

2026-04-27
MicrosoftMicrosoft
UPDATE

GitHub Copilot Silently Adds Itself as Co-Author Without User Consent

2026-04-27
MicrosoftMicrosoft
PARTNERSHIP

HMRC Deploys Microsoft Copilot to 28,000 UK Tax Staff, Eyes Sensitive Government Work

2026-04-27

Comments

Suggested

RocketshipRocketship
PRODUCT LAUNCH

Rocketship Launches AI App Builder with Autonomous Sales Agents

2026-04-27
GitHubGitHub
UPDATE

GitHub Removes GPT-5.3-Codex from Copilot Student Model Picker

2026-04-27
Google / AlphabetGoogle / Alphabet
UPDATE

Google Prepares Credit-Based System for Gemini App and New Image Tools

2026-04-27
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us