Google Open-Sources AMS Tool for Detecting Unsafe LLM Fine-Tunes in Seconds
Key Takeaways
- ▸Detects unsafe fine-tuning and safety training removal in 10-40 seconds using activation space analysis, filling a critical gap in model vetting for deployment safety
- ▸Uses activation fingerprinting to identify when safety-relevant concept directions have "collapsed" due to unsafe fine-tuning, catching models that other methods miss
- ▸Open-source tool available on PyPI with CLI and Python API, supporting both GPU-accelerated and CPU-based scanning, gated and ungated models via Hugging Face authentication
Summary
Google has released Activation-based Model Scanner (AMS), an open-source tool that detects whether language models have had their safety training removed or degraded in 10-40 seconds by analyzing activation patterns in the neural network. The tool addresses a critical gap in AI safety by identifying models that have been "uncensored" or had safety mechanisms abliterated through unsafe fine-tuning—compromised versions that are difficult to spot without specialized analysis. AMS uses activation fingerprinting methodology under the AASE (Activation-based AI Safety Enforcement) framework, measuring whether safety-relevant concept vectors in the model's activation space remain distinct or have collapsed due to fine-tuning. The tool is available on PyPI and GitHub with support for both GPU acceleration (10-40 second scans on NVIDIA A100/L4) and automatic CPU fallback, making it accessible for researchers and organizations evaluating third-party or untrusted models. It includes two detection tiers: a safety structure check that requires no baseline (flagging models with degraded safety training) and identity verification for validating models against official baselines to catch subtle modifications.
- Includes baseline creation and identity verification features to distinguish official models from subtle modifications, abliterated versions, or weight substitutions



