Meta and Google's AI Safety Controls Can Be Stripped in Minutes, FT Testing Reveals
Key Takeaways
- ▸Meta's Llama 3.3 and Google's Gemma 3 can have safety controls removed in under 10 minutes using publicly available tools like Heretic
- ▸Post-training safety alignment is architecturally vulnerable and can be peeled away, leaving models without restrictions
- ▸Current regulatory and governance frameworks lack clear accountability structures for modified open-weight models
Summary
Financial Times testing in partnership with AI safety group Alice has exposed a critical vulnerability: the safety controls embedded in Meta's Llama 3.3 and Google's Gemma 3 open-weight models can be dismantled in under 10 minutes using freely available tools like Heretic on GitHub. Once removed, the models produce unrestricted outputs on prohibited topics including biological weapons and malware creation, demonstrating that post-training safety alignment is easily circumventable.
The vulnerability highlights a fundamental architectural flaw in current AI safety approaches. Companies embed safety guardrails during post-training alignment, but once the model weights are released publicly, developers can strip away these protections entirely. Thousands of modified variants of popular open-weight models already circulate across developer platforms, many with compromised safety controls.
The findings have sparked intense debate over accountability and governance. If a modified Llama variant generates dangerous content, responsibility is ambiguous across multiple parties: Meta, the developer who modified the model, hosting platforms, and end users. Current regulatory frameworks lack clear guidance. The discovery adds pressure on governments already considering AI regulation and raises questions about whether open-weight model distribution itself requires structural guardrails.
- Governments are likely to increase scrutiny on open-weight AI releases, potentially favoring decentralized or more restricted distribution models
Editorial Opinion
This research exposes a critical limitation in relying on post-training controls as the primary safety mechanism for open-weight models. If safety measures can be removed as easily as applying a public tool, then safety must be architected at a more fundamental level—whether through foundational model design or structural constraints on distribution. For regulators, this is a wake-up call: voluntary safety commitments from tech giants are insufficient when the technical means to circumvent them are freely available and require only minutes to deploy.

