Microsoft Research Finds Frontier LLMs Corrupt Documents During Long Delegated Workflows

Key Takeaways

▸Microsoft's DELEGATE-52 benchmark reveals that even frontier LLMs corrupt approximately 25% of document content during long delegated workflows across 52 professional domains
▸The study evaluated 19 LLMs including Claude 4.6 Opus, Gemini 3.1 Pro, and GPT 5.4, with all models showing significant document degradation despite their advanced capabilities
▸Agentic tool use does not improve reliability, and document corruption worsens with file size, interaction length, and presence of distractor content, suggesting systematic issues rather than implementation problems

Source:

Hacker Newshttps://arxiv.org/abs/2604.15597↗

Summary

Microsoft researchers have published a new study introducing DELEGATE-52, a comprehensive benchmark that evaluates how reliably Large Language Models can perform autonomous document editing across 52 professional domains. The research tested 19 different LLMs, including frontier models such as Claude 4.6 Opus, Gemini 3.1 Pro, and GPT 5.4, revealing a critical limitation: even the most advanced models corrupt approximately 25% of document content by the end of long interaction workflows.

The benchmark simulates realistic delegated workflows spanning diverse professional domains including coding, crystallography, and music notation. The researchers discovered that current LLMs silently introduce sparse but severe errors that compound over extended interactions. Notably, the use of agentic tool use did not improve performance, and document degradation worsened with larger files, longer workflows, and the presence of distractor content.

These findings have significant implications for organizations considering AI-powered automation of knowledge work. The study demonstrates that current LLMs cannot be trusted for autonomous document editing without human oversight, as silent corruption could introduce subtle but damaging errors into critical professional documents. The research effectively challenges the readiness of current AI systems for true delegated workflows where humans depend on LLMs to faithfully execute complex tasks.

Current LLMs cannot be reliably trusted for autonomous document editing and cannot serve as faithful delegates in knowledge work without human oversight and validation

Editorial Opinion

This research exposes a critical gap between frontier LLM capabilities on controlled benchmarks and their actual reliability in production workflows. The 25% document corruption rate among state-of-the-art models is alarming and should significantly temper enthusiasm for AI-driven knowledge work automation. The fact that agentic tool use provides no improvement suggests this is a fundamental limitation of current model architectures rather than a solvable engineering problem. Organizations considering delegating document editing to AI systems must treat this as a wake-up call to invest heavily in human oversight and validation mechanisms before adopting these systems at scale.

Microsoft Research Finds Frontier LLMs Corrupt Documents During Long Delegated Workflows

Key Takeaways

▸Microsoft's DELEGATE-52 benchmark reveals that even frontier LLMs corrupt approximately 25% of document content during long delegated workflows across 52 professional domains
▸The study evaluated 19 LLMs including Claude 4.6 Opus, Gemini 3.1 Pro, and GPT 5.4, with all models showing significant document degradation despite their advanced capabilities
▸Agentic tool use does not improve reliability, and document corruption worsens with file size, interaction length, and presence of distractor content, suggesting systematic issues rather than implementation problems

Summary

Current LLMs cannot be reliably trusted for autonomous document editing and cannot serve as faithful delegates in knowledge work without human oversight and validation

Editorial Opinion

This research exposes a critical gap between frontier LLM capabilities on controlled benchmarks and their actual reliability in production workflows. The 25% document corruption rate among state-of-the-art models is alarming and should significantly temper enthusiasm for AI-driven knowledge work automation. The fact that agentic tool use provides no improvement suggests this is a fundamental limitation of current model architectures rather than a solvable engineering problem. Organizations considering delegating document editing to AI systems must treat this as a wake-up call to invest heavily in human oversight and validation mechanisms before adopting these systems at scale.

Microsoft Research Finds Frontier LLMs Corrupt Documents During Long Delegated Workflows

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft to Invest $18B in Australia to Expand AI and Cloud Infrastructure

GitHub Copilot Silently Adds Itself as Co-Author Without User Consent

HMRC Deploys Microsoft Copilot to 28,000 UK Tax Staff, Eyes Sensitive Government Work

Comments

Suggested

Rocketship Launches AI App Builder with Autonomous Sales Agents

GitHub Removes GPT-5.3-Codex from Copilot Student Model Picker

Google Prepares Credit-Based System for Gemini App and New Image Tools

Microsoft Research Finds Frontier LLMs Corrupt Documents During Long Delegated Workflows

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft to Invest $18B in Australia to Expand AI and Cloud Infrastructure

GitHub Copilot Silently Adds Itself as Co-Author Without User Consent

HMRC Deploys Microsoft Copilot to 28,000 UK Tax Staff, Eyes Sensitive Government Work

Comments

Suggested

Rocketship Launches AI App Builder with Autonomous Sales Agents

GitHub Removes GPT-5.3-Codex from Copilot Student Model Picker

Google Prepares Credit-Based System for Gemini App and New Image Tools