Comprehensive Safety Audit of Five Major LLMs Reveals Significant Vulnerabilities: 1 in 3 Harmful Requests Bypassed

Key Takeaways

▸GPT-4o demonstrated the strongest safety performance (89.4% block rate), while Gemini 2.5 Pro was significantly weaker (43.9%), highlighting inconsistent safety standards across industry leaders
▸Copyright/IP protection has the highest bypass rate (53%), while privacy filters failed 69% of the time even in the best-performing model, indicating systematic weaknesses in specific safety categories
▸Open-source benchmark tool released with 42 prompting techniques and 16 risk categories, enabling reproducible evaluation and continuous improvement of LLM safety systems

Source:

Hacker Newshttps://github.com/aestrad7/llm-break-bench↗

Summary

An independent researcher conducted a comprehensive safety benchmark across five major AI language models—GPT-4o, Claude Haiku, Grok, DeepSeek Chat, and Gemini 2.5 Pro—running 3,360 adversarial tests across 16 risk categories and 42 prompting techniques. The results reveal critical vulnerabilities: approximately one-third of harmful requests successfully bypassed safety filters, with significant variation in defensive capabilities across models. GPT-4o emerged as the strongest performer with an 89.4% block rate, while Gemini 2.5 Pro was the most vulnerable at 43.9%, indicating inconsistent safety implementations across the industry.

The study identified copyright and intellectual property content as the highest bypass area (53% failure rate), privacy filters failing 69% of the time even in GPT-4o, and weapons/CBRN content showing persistent vulnerabilities across all models. The researcher released the benchmark as an open-source tool, enabling the AI community to systematically evaluate and improve safety measures. Using 42 different attack techniques—including jailbreaking, obfuscation, social engineering, and academic framing—the research highlights that current safety systems struggle with nuanced categorization and remain vulnerable to sophisticated prompting strategies.

All tested models still allow 20-56% of harmful requests through in specific categories, with weapons/CBRN content showing persistent vulnerabilities despite being the most heavily restricted

Editorial Opinion

This comprehensive safety audit serves as both a wake-up call and a constructive tool for the AI industry. While the 89% block rate from GPT-4o may seem reassuring, the fact that 1 in 3 harmful requests successfully bypass safety filters—and nearly half do in less robust models—underscores the complexity of content moderation at scale. The open-source release of this benchmark is particularly valuable; rather than functioning as a jailbreak tutorial, it provides the community with standardized metrics to measure progress and identify gaps. The stark performance differences between models (GPT-4o vs. Gemini) suggest that safety implementation remains an art rather than a mature science, and systematic approaches like this benchmark are essential for raising the baseline.

Comprehensive Safety Audit of Five Major LLMs Reveals Significant Vulnerabilities: 1 in 3 Harmful Requests Bypassed

Key Takeaways

▸GPT-4o demonstrated the strongest safety performance (89.4% block rate), while Gemini 2.5 Pro was significantly weaker (43.9%), highlighting inconsistent safety standards across industry leaders
▸Copyright/IP protection has the highest bypass rate (53%), while privacy filters failed 69% of the time even in the best-performing model, indicating systematic weaknesses in specific safety categories
▸Open-source benchmark tool released with 42 prompting techniques and 16 risk categories, enabling reproducible evaluation and continuous improvement of LLM safety systems

Summary

All tested models still allow 20-56% of harmful requests through in specific categories, with weapons/CBRN content showing persistent vulnerabilities despite being the most heavily restricted

Editorial Opinion

This comprehensive safety audit serves as both a wake-up call and a constructive tool for the AI industry. While the 89% block rate from GPT-4o may seem reassuring, the fact that 1 in 3 harmful requests successfully bypass safety filters—and nearly half do in less robust models—underscores the complexity of content moderation at scale. The open-source release of this benchmark is particularly valuable; rather than functioning as a jailbreak tutorial, it provides the community with standardized metrics to measure progress and identify gaps. The stark performance differences between models (GPT-4o vs. Gemini) suggest that safety implementation remains an art rather than a mature science, and systematic approaches like this benchmark are essential for raising the baseline.

Comprehensive Safety Audit of Five Major LLMs Reveals Significant Vulnerabilities: 1 in 3 Harmful Requests Bypassed

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

How AI Discourse in Training Data Shapes Model Alignment, Study Shows

Distribution Fine Tuning: New Algorithm Eliminates LLM 'Slop' and Boosts Creativity 164%

MemEye Framework Reveals Gaps in Multimodal Agent Memory: Current VLMs Struggle with Fine-Grained Visual Details

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

Advanced AI Models Bring Government to 'Reflection Point,' CIA Official Says

Comprehensive Safety Audit of Five Major LLMs Reveals Significant Vulnerabilities: 1 in 3 Harmful Requests Bypassed

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

How AI Discourse in Training Data Shapes Model Alignment, Study Shows

Distribution Fine Tuning: New Algorithm Eliminates LLM 'Slop' and Boosts Creativity 164%

MemEye Framework Reveals Gaps in Multimodal Agent Memory: Current VLMs Struggle with Fine-Grained Visual Details

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

Advanced AI Models Bring Government to 'Reflection Point,' CIA Official Says