Study: All Frontier AI Models Vulnerable to Multi-Turn Jailbreaks—Grok at 88%, Claude at 12%
Key Takeaways
- ▸Every frontier model tested fails a non-trivial fraction of multi-turn attacks, ranging from 7.89% to 88.30% success rates
- ▸Multi-turn jailbreak rates are 2–10x higher than single-turn rates, revealing a critical evaluation gap in safety benchmarks
- ▸Claude, Anthropic's safety-focused model, reaches 11–16% vulnerability under iterative attacks versus 2–3% single-turn
Summary
A comprehensive security evaluation of 15 frontier large language models from OpenAI, Anthropic, Google, Amazon, and xAI has found that every model tested exhibits significant vulnerabilities to multi-turn jailbreak attacks. Conducted by researchers Nicholas Conley and Amy Chang, the study reveals a critical weakness in how AI safety is currently evaluated: single-turn benchmarks fail to capture real-world adversarial scenarios where attackers iterate and adapt across multiple turns.
The gap between single-turn and multi-turn attack success rates is dramatic. While single-turn jailbreak rates ranged from 2.19% to 64.91%, multi-turn rates skyrocketed to between 7.89% and 88.30% across the cohort. Anthropic's Claude family—among the strongest single-turn performers at 2.19% to 3.64%—reaches 11.16% to 16.20% under iterative pressure. OpenAI's GPT-5.4 jumps from 2.74% to 24.68% (a 9x increase), while xAI's Grok 4.1 Fast hits 88.30%, the highest rate in the evaluation. Even Amazon's Nova 2 Lite, the best performer, still shows 7.89% vulnerability.
The researchers argue this disparity is fundamental: real attackers don't make single requests—they adapt, reframe refusals, decompose tasks across turns, and escalate gradually. The study evaluated 30,090 single-turn prompts and 6,986 multi-turn attacks across 1,456 conversations, establishing that current industry safety benchmarks significantly underestimate risk by ignoring how adversaries actually operate.
- Current safety reports and model cards are based on single-turn benchmarks that don't reflect real adversarial threat models
- Labs emphasizing capability advancement show wider single-to-multi gaps than labs emphasizing safety in public communications
Editorial Opinion
This research exposes a fundamental credibility problem in how the AI industry measures safety. While companies publish single-turn jailbreak rates in safety reports and model cards, this study demonstrates those metrics bear little relationship to real-world robustness. The finding that labs publicly emphasizing safety show narrower single-to-multi gaps than capability-focused labs suggests institutional incentives may be distorting the safety narrative. Until safety evaluations catch up with actual threat models—where attackers iterate and adapt—frontier model safety claims should be treated with substantial skepticism.


