Alibaba's Qwen Achieves 92% Defense Rate Using Automated Reinforcement Learning Red Teaming
Key Takeaways
- ▸Automated RL-based red teaming enables attacker models to discover novel jailbreak strategies beyond existing datasets, improving upon supervised approaches
- ▸Qwen3.5-4B achieved 92% defense rate against jailbreak attempts through co-evolving attacker-defender training loops with no manual intervention required
- ▸The technique maintains utility on benign tasks (88% accuracy, only 6% drop) by incorporating benign example training alongside adversarial updates
Summary
ClassifexRL researchers have demonstrated a novel approach to improving AI safety by applying automated red teaming with reinforcement learning (RL) to Alibaba's Qwen3.5-4B model. The technique trains an attacker model using GRPO to discover jailbreaks across 366 HarmBench behaviors, then automatically retrains the defender model on the attacker's successful exploits in a co-evolving loop that requires no human supervision.
The approach significantly outperforms previous methods by enabling the attacker to discover novel jailbreak strategies beyond known datasets, rather than merely reproducing them. By incorporating diversity clustering rewards and benign example training, the team achieved a 92% defense rate (28 percentage points above the base model) while maintaining 88% accuracy on legitimate tasks—only a 6% drop from baseline. The fully automated pipeline iteratively improves both attacker and defender across multiple rounds, demonstrating that safety guardrails can be continuously hardened without manual intervention.
This work advances the field beyond supervised fine-tuning approaches like MART, which require curated datasets of known attacks. By optimizing directly for attack success with exploration incentives, the RL-based attacker discovers attack strategies that don't exist in training data, offering a more comprehensive way to identify and patch safety vulnerabilities in large language models.
- Multi-round co-training allows the defender to continuously learn against new attack strategies while retaining defenses against previous ones
Editorial Opinion
This represents a meaningful step forward in AI safety automation. Rather than relying on researchers to manually discover attack vectors, the RL-based approach lets models explore the vulnerability space systematically—a more thorough and scalable path to robust defenses. The maintained performance on benign tasks is particularly important, as it shows safety improvements don't require crippling model usability. However, the work is tested only on a 4B parameter model; whether this approach scales to larger, more capable models (and whether adversarial robustness remains stable as models scale) are critical open questions.



