OpenAI's GPT-5.4 Underperforms in Autonomous Penetration Testing, Struggles with Multi-Step Tasks
Key Takeaways
- ▸GPT-5.4 underperformed GPT-5.3 Codex in penetration testing scenarios, completing only 1 of 3 Hack The Box machines versus its predecessor's perfect score
- ▸The model exhibited premature task termination, stopping after identifying initial attack vectors rather than completing full exploitation chains
- ▸Analysis suggests GPT-5.4 is optimized for "clean task completion" rather than the persistent, multi-step exploration required for autonomous agent workflows
Summary
Independent security researcher HonzaT has reported disappointing results when testing OpenAI's newly released GPT-5.4 model with Strix, an autonomous AI tool for web penetration testing. In comparative tests against Hack The Box machines—intentionally vulnerable systems used for security training—GPT-5.4 significantly underperformed its predecessor, GPT-5.3 Codex. While the model successfully completed one machine quickly, it prematurely terminated exploitation on two others after merely identifying initial attack vectors, failing to follow through with privilege escalation and flag capture.
The researcher's analysis, corroborated by discussions with other AI models, suggests that GPT-5.4's optimization for "clean task completion with fewer iterations" may conflict with the persistent exploration required for penetration testing. This design choice appears to cause the model to interpret tasks as complete once an initial objective is met, rather than continuing multi-step exploitation chains. This behavior contradicts OpenAI's marketing claims that GPT-5.4 "incorporates the industry-leading coding capabilities of GPT-5.3-Codex" and excels at "agentic workflows."
The findings highlight a potential tension between OpenAI's positioning of GPT-5.4 as a unified frontier model and its actual performance in specialized autonomous agent scenarios. While GPT-5.3 Codex successfully completed all three test machines, GPT-5.4's tendency to halt after initial discoveries suggests it may be less suitable for tasks requiring sustained, multi-step problem-solving. The researcher acknowledges that more specific prompting could improve performance, but notes that competing models understood the task requirements without such explicit guidance.
- The performance gap contradicts OpenAI's marketing claims about GPT-5.4 incorporating GPT-5.3-Codex's coding capabilities and excelling at agentic tasks
- The findings raise questions about trade-offs in building general-purpose models versus specialized variants for different use cases
Editorial Opinion
This revealing real-world test exposes an uncomfortable truth about frontier model development: optimization choices matter enormously, and "bigger" doesn't always mean "better" for every use case. OpenAI's decision to consolidate capabilities into GPT-5.4 may have inadvertently created a model that's a jack-of-all-trades but master of none—particularly problematic for autonomous agent applications that require tenacious, multi-step problem-solving. The discrepancy between marketing promises and actual performance in specialized domains should serve as a cautionary tale for enterprises evaluating AI deployments based on benchmark scores and vendor claims rather than task-specific validation.


