Frontier AI Agents Show Rapid Improvement in Multi-Step Cyber-Attack Scenarios

Key Takeaways

▸Frontier AI agents show exponential improvement in multi-step cyber-attack execution: GPT-4o (Aug 2024) completed average 1.7 steps vs. Opus 4.6 (Feb 2026) averaging 9.8 steps on corporate network scenarios at 10M token budgets
▸Scaling inference-time compute from 10M to 100M tokens yields performance gains up to 59%, with implications for how frontier models can be evaluated and their risks assessed
▸Newest model generations (Opus 4.6) demonstrate qualitatively different capabilities than predecessors, with improved token efficiency and deeper specialist skills in reverse engineering, exploit development, and cryptography

Source:

Hacker Newshttps://www.aisi.gov.uk/blog/how-do-frontier-ai-agents-perform-in-multi-step-cyber-attack-scenarios↗

Summary

Anthropic researchers have published findings from a comprehensive evaluation of frontier AI models' capabilities in executing multi-step cyber-attacks within simulated network environments. Testing seven models released between August 2024 and February 2026, the study reveals dramatic improvements in autonomous cyber capabilities: on a 32-step corporate network attack scenario, performance improved from GPT-4o's average of 1.7 completed steps at 10M tokens to Opus 4.6's 9.8 steps—with the best single run completing 22 of 32 steps, equivalent to roughly 6 of the 14 hours a human expert would require.

The evaluation employed two realistic cyber ranges built by cybersecurity experts: "The Last Ones," a 32-step corporate network intrusion requiring credential theft, web application exploitation, binary reverse engineering, and SQL injection chains; and "Cooling Tower," a 7-step industrial control system attack targeting a simulated power plant's cooling system. The research demonstrates two key capability trends: successive model generations consistently outperform predecessors at fixed token budgets, and inference-time compute scaling delivers performance gains of up to 59%, with increases from 10M to 100M tokens showing substantial improvements in autonomous cyber execution.

The study explicitly notes that neither test scenario includes active defenders, meaning the results measure raw capability without accounting for detection systems or defensive countermeasures that would operate in real-world scenarios. The findings underscore the urgent need for more sophisticated AI evaluation methodologies beyond traditional capture-the-flag challenges, and highlight critical implications for AI safety, policy development, and cybersecurity risk assessment as frontier models rapidly advance.

Evaluation methodology using realistic cyber ranges reveals limitations of traditional CTF and Q&A benchmarks in measuring autonomous, long-horizon AI capabilities relevant to real-world security threats
Results measure raw capability without active defenders; real-world threat scenarios would include detection systems and defensive responses that could significantly impact agent performance

Editorial Opinion

This research represents an important step forward in responsible AI evaluation—moving beyond toy benchmarks to test realistic, complex scenarios that matter for security outcomes. The dramatic capability improvements across model generations underscore the urgency of developing both better evaluation frameworks and robust defensive capabilities in parallel with frontier model development. While the absence of active defenders provides a clearer picture of raw capability, it also highlights a critical gap: we need equivalent rigor in evaluating how real-world detection and response systems can mitigate these emerging risks. These findings should serve as a wake-up call for the AI safety and policy communities that autonomy and multi-step reasoning capabilities are advancing faster than our ability to robustly contain or defend against them.

Frontier AI Agents Show Rapid Improvement in Multi-Step Cyber-Attack Scenarios

Key Takeaways

▸Frontier AI agents show exponential improvement in multi-step cyber-attack execution: GPT-4o (Aug 2024) completed average 1.7 steps vs. Opus 4.6 (Feb 2026) averaging 9.8 steps on corporate network scenarios at 10M token budgets
▸Scaling inference-time compute from 10M to 100M tokens yields performance gains up to 59%, with implications for how frontier models can be evaluated and their risks assessed
▸Newest model generations (Opus 4.6) demonstrate qualitatively different capabilities than predecessors, with improved token efficiency and deeper specialist skills in reverse engineering, exploit development, and cryptography

Summary

Evaluation methodology using realistic cyber ranges reveals limitations of traditional CTF and Q&A benchmarks in measuring autonomous, long-horizon AI capabilities relevant to real-world security threats
Results measure raw capability without active defenders; real-world threat scenarios would include detection systems and defensive responses that could significantly impact agent performance

Editorial Opinion

This research represents an important step forward in responsible AI evaluation—moving beyond toy benchmarks to test realistic, complex scenarios that matter for security outcomes. The dramatic capability improvements across model generations underscore the urgency of developing both better evaluation frameworks and robust defensive capabilities in parallel with frontier model development. While the absence of active defenders provides a clearer picture of raw capability, it also highlights a critical gap: we need equivalent rigor in evaluating how real-world detection and response systems can mitigate these emerging risks. These findings should serve as a wake-up call for the AI safety and policy communities that autonomy and multi-step reasoning capabilities are advancing faster than our ability to robustly contain or defend against them.

Frontier AI Agents Show Rapid Improvement in Multi-Step Cyber-Attack Scenarios

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Expands Partnership with SpaceX, Scales GB200 Capacity in Colossus 2

Advanced AI Models Bring Government to 'Reflection Point,' CIA Official Says

Anthropic Claude Code Sandbox Bypass: Second Vulnerability Exposes Critical Data Exfiltration Risk

Comments

Suggested

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

Advanced AI Models Bring Government to 'Reflection Point,' CIA Official Says

Anthropic Claude Code Sandbox Bypass: Second Vulnerability Exposes Critical Data Exfiltration Risk

Frontier AI Agents Show Rapid Improvement in Multi-Step Cyber-Attack Scenarios

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Expands Partnership with SpaceX, Scales GB200 Capacity in Colossus 2

Advanced AI Models Bring Government to 'Reflection Point,' CIA Official Says

Anthropic Claude Code Sandbox Bypass: Second Vulnerability Exposes Critical Data Exfiltration Risk

Comments

Suggested

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

Advanced AI Models Bring Government to 'Reflection Point,' CIA Official Says

Anthropic Claude Code Sandbox Bypass: Second Vulnerability Exposes Critical Data Exfiltration Risk