New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work

Key Takeaways

▸The Remote Labor Index (RLI) tests AI agents on real remote work projects worth over $140,000 and representing 6,000+ hours of human labor, spanning industries from game development to data analysis
▸State-of-the-art AI agents currently perform near the floor on these tasks, with very low automation rates, failing to complete most projects at professionally acceptable quality levels
▸While current performance is low, the benchmark shows measurable progress across models, providing a framework to track the actual pace of AI-driven labor automation

Source:

Hacker Newshttps://www.remotelabor.ai/↗

Summary

The Center for AI Safety, in collaboration with Scale AI, has released the Remote Labor Index (RLI), a comprehensive benchmark designed to measure how well AI agents can automate actual remote work projects. The benchmark comprises real-world, economically valuable tasks spanning game development, product design, architecture, data analysis, and video animation—representing over 6,000 hours of work valued at more than $140,000. Unlike traditional AI benchmarks focused on knowledge and reasoning, RLI tests end-to-end agent performance on complete projects that human professionals have actually completed for clients.

The results reveal a stark reality: state-of-the-art AI agents perform near the floor on these tasks, with even the best-performing models achieving very low automation rates. The research demonstrates that contemporary AI systems fail to complete the vast majority of projects at a quality level that would be acceptable as commissioned work. Projects in the benchmark vary widely in complexity, with some costing over $10,000 and requiring more than 100 hours of human labor to complete.

Despite the low absolute performance, the researchers note that AI models are showing steady improvement, and progress on these complex tasks is measurable. The benchmark is designed to provide a common basis for tracking the trajectory of AI automation capabilities over time. By grounding discussions of AI-driven labor automation in empirical evidence, RLI aims to help stakeholders—including workers, employers, and policymakers—better understand and prepare for the actual pace and scope of AI's impact on the labor market.

The research paper, authored by a large team including researchers from the Center for AI Safety and Scale AI, is now available along with code on GitHub. The project includes a live leaderboard tracking model performance, enabling ongoing assessment of how AI capabilities evolve on real-world work tasks rather than academic benchmarks.

RLI aims to ground AI automation discussions in empirical evidence, helping stakeholders move beyond speculation to understand real-world capabilities and prepare accordingly

Editorial Opinion

The Remote Labor Index represents a crucial reality check for the AI industry and society at large. While AI systems have achieved impressive results on academic benchmarks and narrow tasks, this research reveals a significant gap between benchmark performance and the ability to deliver real economic value through complete, professional-quality work. The sobering results—showing frontier AI agents struggling with tasks that human freelancers routinely complete—suggest that fears of imminent widespread job displacement may be premature, even as the benchmark's design enables us to track when that picture changes. This kind of grounded, empirical assessment is exactly what's needed to replace both hype and anxiety with actionable intelligence.

New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work

Key Takeaways

▸The Remote Labor Index (RLI) tests AI agents on real remote work projects worth over $140,000 and representing 6,000+ hours of human labor, spanning industries from game development to data analysis
▸State-of-the-art AI agents currently perform near the floor on these tasks, with very low automation rates, failing to complete most projects at professionally acceptable quality levels
▸While current performance is low, the benchmark shows measurable progress across models, providing a framework to track the actual pace of AI-driven labor automation

Summary

RLI aims to ground AI automation discussions in empirical evidence, helping stakeholders move beyond speculation to understand real-world capabilities and prepare accordingly

Editorial Opinion

The Remote Labor Index represents a crucial reality check for the AI industry and society at large. While AI systems have achieved impressive results on academic benchmarks and narrow tasks, this research reveals a significant gap between benchmark performance and the ability to deliver real economic value through complete, professional-quality work. The sobering results—showing frontier AI agents struggling with tasks that human freelancers routinely complete—suggest that fears of imminent widespread job displacement may be premature, even as the benchmark's design enables us to track when that picture changes. This kind of grounded, empirical assessment is exactly what's needed to replace both hype and anxiety with actionable intelligence.

New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work

Key Takeaways

Summary

Editorial Opinion

More from Center for AI Safety

Ente Launches Ensu: Privacy-Focused Local LLM App for Personal AI

Ente Launches Ensu: Open-Source Local LLM App with Full Privacy and End-to-End Encryption

Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities

Comments

Suggested

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

OpenAI Prepares for IPO After Musk Lawsuit Threat Clears

Singapore Inks AI Deals with Google

New Benchmark Reveals AI Agents Struggle to Automate Real-World Remote Work

Key Takeaways

Summary

Editorial Opinion

More from Center for AI Safety

Ente Launches Ensu: Privacy-Focused Local LLM App for Personal AI

Ente Launches Ensu: Open-Source Local LLM App with Full Privacy and End-to-End Encryption

Nature Publishes HLE Benchmark: Expert-Level Academic Questions Expose Gaps in Frontier AI Capabilities

Comments

Suggested

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

OpenAI Prepares for IPO After Musk Lawsuit Threat Clears

Singapore Inks AI Deals with Google