BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-05-27

Study: All Frontier AI Models Vulnerable to Multi-Turn Jailbreaks—Grok at 88%, Claude at 12%

Key Takeaways

  • ▸Every frontier model tested fails a non-trivial fraction of multi-turn attacks, ranging from 7.89% to 88.30% success rates
  • ▸Multi-turn jailbreak rates are 2–10x higher than single-turn rates, revealing a critical evaluation gap in safety benchmarks
  • ▸Claude, Anthropic's safety-focused model, reaches 11–16% vulnerability under iterative attacks versus 2–3% single-turn
Source:
Hacker Newshttps://blogs.cisco.com/ai/proprietary-problems↗

Summary

A comprehensive security evaluation of 15 frontier large language models from OpenAI, Anthropic, Google, Amazon, and xAI has found that every model tested exhibits significant vulnerabilities to multi-turn jailbreak attacks. Conducted by researchers Nicholas Conley and Amy Chang, the study reveals a critical weakness in how AI safety is currently evaluated: single-turn benchmarks fail to capture real-world adversarial scenarios where attackers iterate and adapt across multiple turns.

The gap between single-turn and multi-turn attack success rates is dramatic. While single-turn jailbreak rates ranged from 2.19% to 64.91%, multi-turn rates skyrocketed to between 7.89% and 88.30% across the cohort. Anthropic's Claude family—among the strongest single-turn performers at 2.19% to 3.64%—reaches 11.16% to 16.20% under iterative pressure. OpenAI's GPT-5.4 jumps from 2.74% to 24.68% (a 9x increase), while xAI's Grok 4.1 Fast hits 88.30%, the highest rate in the evaluation. Even Amazon's Nova 2 Lite, the best performer, still shows 7.89% vulnerability.

The researchers argue this disparity is fundamental: real attackers don't make single requests—they adapt, reframe refusals, decompose tasks across turns, and escalate gradually. The study evaluated 30,090 single-turn prompts and 6,986 multi-turn attacks across 1,456 conversations, establishing that current industry safety benchmarks significantly underestimate risk by ignoring how adversaries actually operate.

  • Current safety reports and model cards are based on single-turn benchmarks that don't reflect real adversarial threat models
  • Labs emphasizing capability advancement show wider single-to-multi gaps than labs emphasizing safety in public communications

Editorial Opinion

This research exposes a fundamental credibility problem in how the AI industry measures safety. While companies publish single-turn jailbreak rates in safety reports and model cards, this study demonstrates those metrics bear little relationship to real-world robustness. The finding that labs publicly emphasizing safety show narrower single-to-multi gaps than capability-focused labs suggests institutional incentives may be distorting the safety narrative. Until safety evaluations catch up with actual threat models—where attackers iterate and adapt—frontier model safety claims should be treated with substantial skepticism.

Large Language Models (LLMs)Generative AICybersecurityAI Safety & Alignment

More from Anthropic

AnthropicAnthropic
RESEARCH

Anthropic Releases Framework for Using Claude Opus to Secure Source Code and Discover Open Source Vulnerabilities

2026-05-27
AnthropicAnthropic
INDUSTRY REPORT

Claude Dominates Big Pharma's AI Partnerships, Securing 52% of Frontier LLM Deals

2026-05-27
AnthropicAnthropic
INDUSTRY REPORT

AI Jobs Apocalypse Fears Recede as Altman and Amodei Walk Back Predictions

2026-05-27

Comments

Suggested

AnthropicAnthropic
RESEARCH

Anthropic Releases Framework for Using Claude Opus to Secure Source Code and Discover Open Source Vulnerabilities

2026-05-27
ManusManus
POLICY & REGULATION

China Tightens Grip on AI Talent: Travel Restrictions and Investment Controls

2026-05-27
Argonne National LaboratoryArgonne National Laboratory
PRODUCT LAUNCH

Argonne National Laboratory Launches Private AI Inference Service on Spare Supercomputing Capacity

2026-05-27
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us