Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Document Content

Key Takeaways

▸Frontier AI models lose an average of 25% of document content over 20 delegated interactions; degradation reaches 50% across all models
▸Only Python programming achieved 'ready for delegation' status (98%+ accuracy) out of 52 tested professional domains
▸Catastrophic corruption occurs in 80%+ of model/domain combinations, contradicting company claims about autonomous workflow capabilities

Source:

Hacker Newshttps://www.theregister.com/ai-ml/2026/05/11/microsoft-researchers-find-ai-models-and-agents-cant-handle-long-running-tasks/5238263↗

Summary

Microsoft Research has published findings that challenge widespread claims about AI agents' readiness for autonomous workflows. Researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville created DELEGATE-52, a benchmark testing how frontier language models (Claude 4.6 Opus, GPT 5.4, and Gemini 3.1 Pro) handle multistep professional tasks across 52 domains including accounting, code writing, and music notation.

The results are stark: frontier models lose an average of 25 percent of document content over 20 delegated interactions, with overall degradation across all models reaching 50 percent. When researchers set a "ready" threshold of 98 percent accuracy after 20 interactions, only Python programming qualified as a domain where current LLMs are adequately prepared for delegated workflows. Google's Gemini 3.1 Pro, the best performer, was ready for only 11 of 52 domains tested.

The study found catastrophic corruption (scores of 80 percent or less) in more than 80 percent of model/domain combinations, directly contradicting company marketing claims. Anthropic promotes Claude as handling autonomous tasks to "return a finished deliverable," while Microsoft touts Copilot's ability to "tackle complex, multistep research." The research suggests businesses should approach AI agents for critical document workflows with extreme caution.

Errors cluster rather than accumulate—models lose 10-30 percentage points in single interactions, making failures difficult to predict

Editorial Opinion

This research exposes a critical gap between AI vendor marketing and reality. Companies like Anthropic and Microsoft have aggressively promoted agents as capable of handling complex autonomous workflows, yet this rigorous study from Microsoft's own researchers shows the models are broadly unready for this task. The findings are particularly damning given that frontier models fail catastrophically in 80% of scenarios. Organizations considering AI agents for critical workflows should delay deployment until significant reliability improvements materialize.

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Document Content

Key Takeaways

▸Frontier AI models lose an average of 25% of document content over 20 delegated interactions; degradation reaches 50% across all models
▸Only Python programming achieved 'ready for delegation' status (98%+ accuracy) out of 52 tested professional domains
▸Catastrophic corruption occurs in 80%+ of model/domain combinations, contradicting company claims about autonomous workflow capabilities

Summary

Errors cluster rather than accumulate—models lose 10-30 percentage points in single interactions, making failures difficult to predict

Editorial Opinion

This research exposes a critical gap between AI vendor marketing and reality. Companies like Anthropic and Microsoft have aggressively promoted agents as capable of handling complex autonomous workflows, yet this rigorous study from Microsoft's own researchers shows the models are broadly unready for this task. The findings are particularly damning given that frontier models fail catastrophically in 80% of scenarios. Organizations considering AI agents for critical workflows should delay deployment until significant reliability improvements materialize.

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Document Content

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft Launches Investigation into Israeli Military's Surveillance Use of Azure Cloud Storage

Microsoft Launches Multi-Model Agentic Security System Achieving Top Benchmark Performance

Critical RCE Vulnerability Discovered in VSCode Copilot Chat Agent Mode

Comments

Suggested

OGX 1.0 Launches: Open-Source Server Unifies OpenAI, Anthropic, and Google SDKs

NVIDIA Releases Numba-CUDA-MLIR: MLIR-Based GPU Compiler for Python

Anthropic Integrates Claude Code Sessions with GitHub and Linear Issues

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Document Content

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

Microsoft Launches Investigation into Israeli Military's Surveillance Use of Azure Cloud Storage

Microsoft Launches Multi-Model Agentic Security System Achieving Top Benchmark Performance

Critical RCE Vulnerability Discovered in VSCode Copilot Chat Agent Mode

Comments

Suggested

OGX 1.0 Launches: Open-Source Server Unifies OpenAI, Anthropic, and Google SDKs

NVIDIA Releases Numba-CUDA-MLIR: MLIR-Based GPU Compiler for Python

Anthropic Integrates Claude Code Sessions with GitHub and Linear Issues