BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
RESEARCHMicrosoft2026-05-13

Microsoft Study Reveals AI Models Fail at Long-Running Tasks, Losing 25% of Document Content

Key Takeaways

  • ▸Frontier AI models lose an average of 25% of document content over 20 delegated interactions; degradation reaches 50% across all models
  • ▸Only Python programming achieved 'ready for delegation' status (98%+ accuracy) out of 52 tested professional domains
  • ▸Catastrophic corruption occurs in 80%+ of model/domain combinations, contradicting company claims about autonomous workflow capabilities
Source:
Hacker Newshttps://www.theregister.com/ai-ml/2026/05/11/microsoft-researchers-find-ai-models-and-agents-cant-handle-long-running-tasks/5238263↗

Summary

Microsoft Research has published findings that challenge widespread claims about AI agents' readiness for autonomous workflows. Researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville created DELEGATE-52, a benchmark testing how frontier language models (Claude 4.6 Opus, GPT 5.4, and Gemini 3.1 Pro) handle multistep professional tasks across 52 domains including accounting, code writing, and music notation.

The results are stark: frontier models lose an average of 25 percent of document content over 20 delegated interactions, with overall degradation across all models reaching 50 percent. When researchers set a "ready" threshold of 98 percent accuracy after 20 interactions, only Python programming qualified as a domain where current LLMs are adequately prepared for delegated workflows. Google's Gemini 3.1 Pro, the best performer, was ready for only 11 of 52 domains tested.

The study found catastrophic corruption (scores of 80 percent or less) in more than 80 percent of model/domain combinations, directly contradicting company marketing claims. Anthropic promotes Claude as handling autonomous tasks to "return a finished deliverable," while Microsoft touts Copilot's ability to "tackle complex, multistep research." The research suggests businesses should approach AI agents for critical document workflows with extreme caution.

  • Errors cluster rather than accumulate—models lose 10-30 percentage points in single interactions, making failures difficult to predict

Editorial Opinion

This research exposes a critical gap between AI vendor marketing and reality. Companies like Anthropic and Microsoft have aggressively promoted agents as capable of handling complex autonomous workflows, yet this rigorous study from Microsoft's own researchers shows the models are broadly unready for this task. The findings are particularly damning given that frontier models fail catastrophically in 80% of scenarios. Organizations considering AI agents for critical workflows should delay deployment until significant reliability improvements materialize.

Large Language Models (LLMs)Natural Language Processing (NLP)AI AgentsMachine Learning

More from Microsoft

MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Launches DirectX Dump Files Public Preview for Cross-Vendor GPU Debugging

2026-06-19
MicrosoftMicrosoft
UPDATE

GitHub Copilot Reopens Individual Plan Sign-Ups with Flexible Usage Management Features

2026-06-17
MicrosoftMicrosoft
RESEARCH

Researchers Expose Critical Microsoft Copilot Vulnerability Bypassing Security to Steal 2FA Codes

2026-06-16

Comments

Suggested

Z.aiZ.ai
PRODUCT LAUNCH

Z.ai Launches GLM-5.2, Claims Fable 5-Class Model Coming Within Months

2026-06-20
Moebius Research ProjectMoebius Research Project
RESEARCH

Moebius: Lightweight Image Inpainting Framework Achieves 10B-Level Quality with Just 0.2B Parameters

2026-06-20
InceptionInception
PRODUCT LAUNCH

Inception Unveils Mercury 2: Parallel-Token Diffusion Models Reshape LLM Performance Economics

2026-06-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us