BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-03-29

Researchers Achieve 100% Interception Rate Against Multi-Turn Jailbreaks on GPT-4o-mini and Gemini

Key Takeaways

  • ▸SFD-Defense achieves 100% interception of multi-turn jailbreaks on both GPT-4o-mini and Gemini 2.5 Flash using an external supervisor model
  • ▸The framework reveals that current LLM safety relies on different architectural approaches: Gemini uses continuous semantic space while GPT uses circuit breaker patterns
  • ▸Defense operates at the semantic/conversational level where attacks cumulate, not at signal level like existing defenses, addressing a fundamental gap in AI safety
Source:
Hacker Newshttps://zenodo.org/records/19314889↗

Summary

Researchers at mthree have demonstrated a novel defense framework called SFD-Defense that achieves complete interception of multi-turn jailbreak attacks on both OpenAI's GPT-4o-mini and Google's Gemini 2.5 Flash models. The four-layer defense architecture, derived from Semantic Flow Dynamics (SFD) framework, uses an external supervisor model (called "Teacher") to detect and block cumulative jailbreak attempts at the conversational level, achieving 100% interception rates with minimal false positives (10% for Gemini, 0% for GPT-4o-mini).

The research reveals fundamental architectural differences between the two models' safety implementations. Gemini exhibits a continuous semantic space with predictable behavior patterns, while GPT-4o-mini employs a "circuit breaker" pattern that locks responses at safety thresholds but at the cost of robustness. Notably, SFD-Defense actually improves GPT-4o-mini's performance by reducing unnecessary circuit breaker triggering from 37.8% to 14.0%, while maintaining its defensive capabilities.

The study validates theoretical predictions about current LLM architectures, including the finding that models without persistent memory cannot effectively anchor safety defenses on themselves. The SFD-Defense framework operates at the semantic level—where multi-turn attacks actually cumulate—rather than at the signal level like existing defenses, representing a fundamental advancement in AI safety engineering.

  • SFD-Defense improves overall system robustness on GPT-4o-mini by reducing unnecessary safety locks from 37.8% to 14.0% while maintaining security

Editorial Opinion

This research represents a significant methodological advance in AI safety by attacking jailbreaks at their root—the cumulative semantic effects across conversation turns—rather than treating each response in isolation. The achievement of 100% interception rates with minimal false positives on production models is noteworthy, though the work raises important questions about whether external supervisor models introduce new dependencies and potential failure modes. The framework's model-independence and lack of performance overhead make it particularly promising for deployment.

Large Language Models (LLMs)Natural Language Processing (NLP)CybersecurityAI Safety & Alignment

More from OpenAI

OpenAIOpenAI
INDUSTRY REPORT

AI Chatbots Are Homogenizing College Classroom Discussions, Yale Students Report

2026-04-05
OpenAIOpenAI
FUNDING & BUSINESS

OpenAI Announces Executive Reshuffle: COO Lightcap Moves to Special Projects, Simo Takes Medical Leave

2026-04-04
OpenAIOpenAI
PARTNERSHIP

OpenAI Acquires TBPN Podcast to Control AI Narrative and Reach Influential Tech Audience

2026-04-04

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us