METR Metrics Show AI Task-Completion Ability Doubling Every 7 Months; One-Month Horizon Expected by 2029

Key Takeaways

▸METR introduces the '50%-task-completion time horizon' metric, measuring software task complexity AI models can handle—a clearer measure of progress than raw benchmark scores
▸AI task-completion ability has been doubling every 7 months; o3 now reaches 110 minutes vs. GPT-2's 2 seconds, with one-month horizon projected for mid-2029 (central estimate)
▸Critical caveat: 80% success rate is 4-6x shorter than 50% rate, and performance collapses in messy, real-world environments without clear feedback mechanisms or documentation

Source:

Hacker Newshttp://muratbuffalo.blogspot.com/2026/03/measuring-ai-ability-to-complete-long.html↗

Summary

METR (Model Evaluation & Threat Research) has introduced a new benchmark metric called the "50%-task-completion time horizon," which measures the length of software engineering tasks that frontier AI models can complete with 50% success rate. Evaluating 12 frontier models across 170 tasks and over 800 human developer baselines, the research reveals a striking trend: this time horizon has been doubling every 7 months since 2019, with the o3 model now reaching a 110-minute horizon compared to GPT-2's 2 seconds. If this exponential trend continues, AI systems could reach a one-month time horizon (equivalent to 167 working hours of skilled developer effort) between mid-2028 and mid-2031, with a central estimate of mid-2029.

However, METR's analysis reveals critical caveats that temper these headline numbers. The 50% success rate masks a significant reliability gap: at 80% success rate, the time horizon shrinks 4-6x shorter, meaning AI can reliably complete week-long tasks rather than month-long ones. Performance degrades sharply in messy environments that lack clear feedback mechanisms like automated tests, and the metrics are calibrated against general-purpose developers, not domain experts. AI performance on real pull requests aligns more closely with low-context contractors (5-18x slower than expert maintainers) than with experienced developers, suggesting the effective timeline for complex legacy systems remains far shorter.

The research also highlights that well-structured, well-tested codebases will see far greater gains from AI automation than legacy systems with poor documentation. This suggests an uneven near-term impact: greenfield projects and new feature development could be transformed by autonomous AI agents, while deep maintenance, debugging of subtle production issues, and evolution of tightly-coupled legacy systems will remain primarily human work. The study's methodology is rigorous and reproducible, with validation across multiple benchmarks including SWE-bench, though the task distribution remains skewed toward isolated coding challenges rather than the full scope of real-world software engineering practices like architectural decision-making and live operations.

AI agents perform at low-context contractor level (5-18x slower than expert maintainers), suggesting significant gaps for legacy systems and domain-specific tasks requiring institutional knowledge
Greenfield work and well-tested codebases will see rapid gains, while maintenance, debugging, and deeply coupled legacy systems will remain harder to automate for years to come

Editorial Opinion

METR's research makes a valuable contribution by centering AI progress on a practical, human-calibrated metric rather than abstract benchmark scores. The exponential trend is genuinely striking, but the reliability gap between 50% and 80% success rates deserves equal attention—an AI system that sometimes completes a month of work is useful primarily for greenfield projects, not the maintenance and debugging that consumes much of real software engineering. The near-term reality is likely a sharp productivity divide: teams building new systems from scratch will see transformative gains, while teams maintaining complex legacy codebases will realize much more modest improvements.

METR Metrics Show AI Task-Completion Ability Doubling Every 7 Months; One-Month Horizon Expected by 2029

Key Takeaways

▸METR introduces the '50%-task-completion time horizon' metric, measuring software task complexity AI models can handle—a clearer measure of progress than raw benchmark scores
▸AI task-completion ability has been doubling every 7 months; o3 now reaches 110 minutes vs. GPT-2's 2 seconds, with one-month horizon projected for mid-2029 (central estimate)
▸Critical caveat: 80% success rate is 4-6x shorter than 50% rate, and performance collapses in messy, real-world environments without clear feedback mechanisms or documentation

Summary

AI agents perform at low-context contractor level (5-18x slower than expert maintainers), suggesting significant gaps for legacy systems and domain-specific tasks requiring institutional knowledge
Greenfield work and well-tested codebases will see rapid gains, while maintenance, debugging, and deeply coupled legacy systems will remain harder to automate for years to come

Editorial Opinion

METR's research makes a valuable contribution by centering AI progress on a practical, human-calibrated metric rather than abstract benchmark scores. The exponential trend is genuinely striking, but the reliability gap between 50% and 80% success rates deserves equal attention—an AI system that sometimes completes a month of work is useful primarily for greenfield projects, not the maintenance and debugging that consumes much of real software engineering. The near-term reality is likely a sharp productivity divide: teams building new systems from scratch will see transformative gains, while teams maintaining complex legacy codebases will realize much more modest improvements.

METR Metrics Show AI Task-Completion Ability Doubling Every 7 Months; One-Month Horizon Expected by 2029

Key Takeaways

Summary

Editorial Opinion

More from METR

VictoriaMetrics Introduces Retroactive Sampling to Optimize OpenTelemetry Tail Sampling

Scientists Discover Carbohydrate Preference, Not Calories, Drives Weight Gain from Bread and Rice

Neurometric Launches SLM Marketplace with 115 Free Task-Specific Models for AI Agents

Comments

Suggested

Taylor Swift Trademarks Voice and Image to Combat AI-Generated Impersonations

Microsoft and OpenAI End Exclusivity Agreement, Opening Path for Multi-Cloud Future

Pilot Protocol Launches Novel Reputation System for AI Agents, Ditching Blockchain for Speed

METR Metrics Show AI Task-Completion Ability Doubling Every 7 Months; One-Month Horizon Expected by 2029

Key Takeaways

Summary

Editorial Opinion

More from METR

VictoriaMetrics Introduces Retroactive Sampling to Optimize OpenTelemetry Tail Sampling

Scientists Discover Carbohydrate Preference, Not Calories, Drives Weight Gain from Bread and Rice

Neurometric Launches SLM Marketplace with 115 Free Task-Specific Models for AI Agents

Comments

Suggested

Taylor Swift Trademarks Voice and Image to Combat AI-Generated Impersonations

Microsoft and OpenAI End Exclusivity Agreement, Opening Path for Multi-Cloud Future

Pilot Protocol Launches Novel Reputation System for AI Agents, Ditching Blockchain for Speed