BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-03-06

OpenAI's GPT-5.4 Underperforms in Autonomous Penetration Testing, Struggles with Multi-Step Tasks

Key Takeaways

  • ▸GPT-5.4 underperformed GPT-5.3 Codex in penetration testing scenarios, completing only 1 of 3 Hack The Box machines versus its predecessor's perfect score
  • ▸The model exhibited premature task termination, stopping after identifying initial attack vectors rather than completing full exploitation chains
  • ▸Analysis suggests GPT-5.4 is optimized for "clean task completion" rather than the persistent, multi-step exploration required for autonomous agent workflows
Source:
Hacker Newshttps://theartificialq.github.io/2026/03/05/how-gpt-5-4-performed-with-strix-and-why-it-fell-short.html↗

Summary

Independent security researcher HonzaT has reported disappointing results when testing OpenAI's newly released GPT-5.4 model with Strix, an autonomous AI tool for web penetration testing. In comparative tests against Hack The Box machines—intentionally vulnerable systems used for security training—GPT-5.4 significantly underperformed its predecessor, GPT-5.3 Codex. While the model successfully completed one machine quickly, it prematurely terminated exploitation on two others after merely identifying initial attack vectors, failing to follow through with privilege escalation and flag capture.

The researcher's analysis, corroborated by discussions with other AI models, suggests that GPT-5.4's optimization for "clean task completion with fewer iterations" may conflict with the persistent exploration required for penetration testing. This design choice appears to cause the model to interpret tasks as complete once an initial objective is met, rather than continuing multi-step exploitation chains. This behavior contradicts OpenAI's marketing claims that GPT-5.4 "incorporates the industry-leading coding capabilities of GPT-5.3-Codex" and excels at "agentic workflows."

The findings highlight a potential tension between OpenAI's positioning of GPT-5.4 as a unified frontier model and its actual performance in specialized autonomous agent scenarios. While GPT-5.3 Codex successfully completed all three test machines, GPT-5.4's tendency to halt after initial discoveries suggests it may be less suitable for tasks requiring sustained, multi-step problem-solving. The researcher acknowledges that more specific prompting could improve performance, but notes that competing models understood the task requirements without such explicit guidance.

  • The performance gap contradicts OpenAI's marketing claims about GPT-5.4 incorporating GPT-5.3-Codex's coding capabilities and excelling at agentic tasks
  • The findings raise questions about trade-offs in building general-purpose models versus specialized variants for different use cases

Editorial Opinion

This revealing real-world test exposes an uncomfortable truth about frontier model development: optimization choices matter enormously, and "bigger" doesn't always mean "better" for every use case. OpenAI's decision to consolidate capabilities into GPT-5.4 may have inadvertently created a model that's a jack-of-all-trades but master of none—particularly problematic for autonomous agent applications that require tenacious, multi-step problem-solving. The discrepancy between marketing promises and actual performance in specialized domains should serve as a cautionary tale for enterprises evaluating AI deployments based on benchmark scores and vendor claims rather than task-specific validation.

Large Language Models (LLMs)AI AgentsMachine LearningCybersecurityProduct Launch

More from OpenAI

OpenAIOpenAI
INDUSTRY REPORT

AI Chatbots Are Homogenizing College Classroom Discussions, Yale Students Report

2026-04-05
OpenAIOpenAI
FUNDING & BUSINESS

OpenAI Announces Executive Reshuffle: COO Lightcap Moves to Special Projects, Simo Takes Medical Leave

2026-04-04
OpenAIOpenAI
PARTNERSHIP

OpenAI Acquires TBPN Podcast to Control AI Narrative and Reach Influential Tech Audience

2026-04-04

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us