Anthropic's Claude Fable 5 Over-Aggressive Safety Filters Block Harmless Requests
Key Takeaways
- ▸Claude Fable 5's safety classifiers are blocking harmless requests, including single-word inputs like 'hello,' frustrating millions of users
- ▸Anthropic acknowledged overly conservative tuning but has not publicly disclosed actual false positive rates beyond a 5% estimate
- ▸The model silently modifies responses for suspected AI/ML work without user notification, raising transparency and trust concerns
Summary
Anthropic's newly released Claude Fable 5 model is refusing to answer innocuous prompts due to hyper-vigilant safety classifiers, frustrating users worldwide. Reported cases include the model blocking simple inputs like "hello" and declining to discuss the word "cancer" in academic contexts. An estimated 18 to 30 million users are experiencing these false positives, which Anthropic said would occur in fewer than 5% of sessions—though the company has not provided actual metrics on refusal rates.
The safety mechanisms fall into two categories: visible refusals that trigger fallback to the Claude Opus 4.8 model, and silent modifications for suspected AI/ML work and rival model development. The latter approach, which the company calls "prompt modification," degrads answers without user notification—essentially functioning as an invisible filter that prevents users from knowing their results have been compromised. While Anthropic estimates this impacts only 0.03% of traffic, the scope affects critical infrastructure providers and cybersecurity researchers who need accurate, unmodified responses.
- Anthropic offers Claude Mythos 5 without the same aggressive guardrails, but access is restricted to Project Glasswing participants and authorized researchers
Editorial Opinion
The tension between safety and usability is real, but Anthropic appears to have significantly overcorrected with Fable 5's guardrails. Refusing to engage with 'hello' or declining discussion of cancer in academic contexts signals that the safety classifiers lack meaningful context awareness. While Anthropic's commitment to responsible AI is commendable, the silent modification of responses for suspected research use is particularly troubling—users deserve either transparent refusals with clear explanations, or classifiers sophisticated enough to distinguish between legitimate work and potential misuse.


