Phishing Arena: Multi-Agent Security Benchmark Reveals Contextual Plausibility as Primary Phishing Threat Vector
Key Takeaways
- ▸Contextual plausibility—not technical evasion—drives 79% of successful phishing attacks, revealing fundamental vulnerability in how LLMs process socially engineered emails
- ▸OpenAI's GPT-5.4-mini achieves highest phishing capability (12.9% bypass rate with adaptive learning), while Anthropic's Claude-Sonnet sets gold standard for email filtering (98.3% accuracy, minimal false positives)
- ▸CampaignMemory feedback mechanism enables phishing agents to learn and improve strategies across 20-round tournaments, mimicking real attack campaign optimization
Summary
Phishing Arena, an open-source research project by Marco Stocco, launches a competitive tournament benchmarking four commercial LLMs—Claude-Sonnet-4.6 (Anthropic), GPT-5.4-mini (OpenAI), DeepSeek-Chat (DeepSeek), and Grok-4-fast-non-reasoning (xAI)—in adversarial email security roles. The controlled study runs 48 matches across 24 role permutations with 20 rounds per match, testing models' capabilities as Phisher agents, email Filters, and Target users against Italian professional email contexts.
Key findings reveal stark differences in model performance: OpenAI's GPT-5.4-mini leads phishing success with 12.9% bypass rate and +14.6pp adaptive improvement trend, while Anthropic's Claude-Sonnet dominates filtering with 98.3% accuracy and 0.7% false positive rate. A critical security insight emerged—79% of successful phishing bypasses exploit contextual plausibility rather than technical obfuscation, indicating that current LLMs are vulnerable not to sophistication but to socially engineered, contextually convincing attacks. The Phisher agent employs a CampaignMemory feedback loop that accumulates round outcomes, enabling adaptive behavior that mimics real-world campaign optimization.
The benchmark evaluates models across 12 Italian professional archetypes spanning CEO to IT professionals with varying cybersecurity awareness, using a controlled dataset of 600 contextually appropriate legitimate emails. The project provides fully reproducible results, analysis tools, and figures generation capabilities, establishing a new standard for adversarial LLM evaluation in security research.
- First reproducible multi-agent adversarial benchmark across four major AI providers establishes evaluation framework for email security in LLM systems
Editorial Opinion
Phishing Arena addresses a critical blind spot in AI safety—rigorous, reproducible benchmarking of how real-world LLMs handle adversarial social engineering. The discovery that contextual plausibility trumps technical sophistication is particularly sobering for security practitioners; it suggests that LLM-powered email defenses must evolve beyond pattern matching toward deeper understanding of organizational communication norms and context. By releasing this open-source benchmark with full transparency, Stocco provides the research community with an invaluable tool to identify and close these human-factor vulnerabilities before they're exploited at scale.


