OpenAI's GPT-5.4 Underperforms in Autonomous Penetration Testing, Struggles with Multi-Step Tasks

Key Takeaways

▸GPT-5.4 underperformed GPT-5.3 Codex in penetration testing scenarios, completing only 1 of 3 Hack The Box machines versus its predecessor's perfect score
▸The model exhibited premature task termination, stopping after identifying initial attack vectors rather than completing full exploitation chains
▸Analysis suggests GPT-5.4 is optimized for "clean task completion" rather than the persistent, multi-step exploration required for autonomous agent workflows

Source:

Hacker Newshttps://theartificialq.github.io/2026/03/05/how-gpt-5-4-performed-with-strix-and-why-it-fell-short.html↗

Summary

Independent security researcher HonzaT has reported disappointing results when testing OpenAI's newly released GPT-5.4 model with Strix, an autonomous AI tool for web penetration testing. In comparative tests against Hack The Box machines—intentionally vulnerable systems used for security training—GPT-5.4 significantly underperformed its predecessor, GPT-5.3 Codex. While the model successfully completed one machine quickly, it prematurely terminated exploitation on two others after merely identifying initial attack vectors, failing to follow through with privilege escalation and flag capture.

The researcher's analysis, corroborated by discussions with other AI models, suggests that GPT-5.4's optimization for "clean task completion with fewer iterations" may conflict with the persistent exploration required for penetration testing. This design choice appears to cause the model to interpret tasks as complete once an initial objective is met, rather than continuing multi-step exploitation chains. This behavior contradicts OpenAI's marketing claims that GPT-5.4 "incorporates the industry-leading coding capabilities of GPT-5.3-Codex" and excels at "agentic workflows."

The findings highlight a potential tension between OpenAI's positioning of GPT-5.4 as a unified frontier model and its actual performance in specialized autonomous agent scenarios. While GPT-5.3 Codex successfully completed all three test machines, GPT-5.4's tendency to halt after initial discoveries suggests it may be less suitable for tasks requiring sustained, multi-step problem-solving. The researcher acknowledges that more specific prompting could improve performance, but notes that competing models understood the task requirements without such explicit guidance.

The performance gap contradicts OpenAI's marketing claims about GPT-5.4 incorporating GPT-5.3-Codex's coding capabilities and excelling at agentic tasks
The findings raise questions about trade-offs in building general-purpose models versus specialized variants for different use cases

Editorial Opinion

This revealing real-world test exposes an uncomfortable truth about frontier model development: optimization choices matter enormously, and "bigger" doesn't always mean "better" for every use case. OpenAI's decision to consolidate capabilities into GPT-5.4 may have inadvertently created a model that's a jack-of-all-trades but master of none—particularly problematic for autonomous agent applications that require tenacious, multi-step problem-solving. The discrepancy between marketing promises and actual performance in specialized domains should serve as a cautionary tale for enterprises evaluating AI deployments based on benchmark scores and vendor claims rather than task-specific validation.

OpenAI's GPT-5.4 Underperforms in Autonomous Penetration Testing, Struggles with Multi-Step Tasks

Key Takeaways

▸GPT-5.4 underperformed GPT-5.3 Codex in penetration testing scenarios, completing only 1 of 3 Hack The Box machines versus its predecessor's perfect score
▸The model exhibited premature task termination, stopping after identifying initial attack vectors rather than completing full exploitation chains
▸Analysis suggests GPT-5.4 is optimized for "clean task completion" rather than the persistent, multi-step exploration required for autonomous agent workflows

Summary

The performance gap contradicts OpenAI's marketing claims about GPT-5.4 incorporating GPT-5.3-Codex's coding capabilities and excelling at agentic tasks
The findings raise questions about trade-offs in building general-purpose models versus specialized variants for different use cases

Editorial Opinion

This revealing real-world test exposes an uncomfortable truth about frontier model development: optimization choices matter enormously, and "bigger" doesn't always mean "better" for every use case. OpenAI's decision to consolidate capabilities into GPT-5.4 may have inadvertently created a model that's a jack-of-all-trades but master of none—particularly problematic for autonomous agent applications that require tenacious, multi-step problem-solving. The discrepancy between marketing promises and actual performance in specialized domains should serve as a cautionary tale for enterprises evaluating AI deployments based on benchmark scores and vendor claims rather than task-specific validation.

OpenAI's GPT-5.4 Underperforms in Autonomous Penetration Testing, Struggles with Multi-Step Tasks

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

Investigation Uncovers AI-Generated Deepfakes in Lily Jay Foundation Charity Fraud

AI Boom Decimates Entry-Level Programming Jobs While Senior Roles Thrive

Study Reveals LLMs Cannot Incorporate Evidence in Scientific Reasoning

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

OpenAI's GPT-5.4 Underperforms in Autonomous Penetration Testing, Struggles with Multi-Step Tasks

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

Investigation Uncovers AI-Generated Deepfakes in Lily Jay Foundation Charity Fraud

AI Boom Decimates Entry-Level Programming Jobs While Senior Roles Thrive

Study Reveals LLMs Cannot Incorporate Evidence in Scientific Reasoning

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains