Comprehensive Safety Audit of Five Major LLMs Reveals Significant Vulnerabilities: 1 in 3 Harmful Requests Bypassed
Key Takeaways
- ▸GPT-4o demonstrated the strongest safety performance (89.4% block rate), while Gemini 2.5 Pro was significantly weaker (43.9%), highlighting inconsistent safety standards across industry leaders
- ▸Copyright/IP protection has the highest bypass rate (53%), while privacy filters failed 69% of the time even in the best-performing model, indicating systematic weaknesses in specific safety categories
- ▸Open-source benchmark tool released with 42 prompting techniques and 16 risk categories, enabling reproducible evaluation and continuous improvement of LLM safety systems
Summary
An independent researcher conducted a comprehensive safety benchmark across five major AI language models—GPT-4o, Claude Haiku, Grok, DeepSeek Chat, and Gemini 2.5 Pro—running 3,360 adversarial tests across 16 risk categories and 42 prompting techniques. The results reveal critical vulnerabilities: approximately one-third of harmful requests successfully bypassed safety filters, with significant variation in defensive capabilities across models. GPT-4o emerged as the strongest performer with an 89.4% block rate, while Gemini 2.5 Pro was the most vulnerable at 43.9%, indicating inconsistent safety implementations across the industry.
The study identified copyright and intellectual property content as the highest bypass area (53% failure rate), privacy filters failing 69% of the time even in GPT-4o, and weapons/CBRN content showing persistent vulnerabilities across all models. The researcher released the benchmark as an open-source tool, enabling the AI community to systematically evaluate and improve safety measures. Using 42 different attack techniques—including jailbreaking, obfuscation, social engineering, and academic framing—the research highlights that current safety systems struggle with nuanced categorization and remain vulnerable to sophisticated prompting strategies.
- All tested models still allow 20-56% of harmful requests through in specific categories, with weapons/CBRN content showing persistent vulnerabilities despite being the most heavily restricted
Editorial Opinion
This comprehensive safety audit serves as both a wake-up call and a constructive tool for the AI industry. While the 89% block rate from GPT-4o may seem reassuring, the fact that 1 in 3 harmful requests successfully bypass safety filters—and nearly half do in less robust models—underscores the complexity of content moderation at scale. The open-source release of this benchmark is particularly valuable; rather than functioning as a jailbreak tutorial, it provides the community with standardized metrics to measure progress and identify gaps. The stark performance differences between models (GPT-4o vs. Gemini) suggest that safety implementation remains an art rather than a mature science, and systematic approaches like this benchmark are essential for raising the baseline.



