If You Missed It: Anthropic's Claude Fable 5 Was Bypassed in 48 Hours

On Tuesday, Anthropic launched Claude Fable 5, their first publicly available Mythos-class model.

It ships with a dedicated classifier layer that sits on top of the actual model and redirects sensitive queries (cybersecurity, bio, chemistry) to the weaker Opus 4.8 instead of answering them with Fable.

Anthropic reportedly ran over 1,000 hours of internal red-teaming before launch and found nothing.

Pliny the Liberator broke it in 48 hours.

The techniques he used are worth understanding because they're not exotic:

Unicode and homoglyph substitution to slip past text pattern matching
Long-context framing to push the classifier's attention elsewhere
Narrative and fiction framing
Decomposition and recomposition

That last one is the technique I keep coming back to.

Instead of submitting one obviously sensitive request, the attacker breaks it into multiple fragments. Each fragment looks harmless in isolation, so the classifier approves it. The responses are then recombined outside the model into something the classifier would never have allowed as a single request.

The classifier evaluated each fragment.

Each fragment was fine.

The problem was what they added up to.

And the classifier never saw that.

The Same Pattern Is Showing Up Elsewhere

This is exactly the pattern emerging from the data in my adversarial game.

Players independently converge on multi-message attack chains where:

Message one establishes context or worldbuilding
Message two appears to be clarification
Message three activates the thing that was built

No individual message appears dangerous.

The risk exists in the sequence.

Stateless defences — which still make up the majority of deployed systems — evaluate prompts independently and completely miss the attack because the attack never existed in any single prompt to begin with.

The Fable situation is obviously a different context. Anthropic's concern is dual-use misuse rather than data exfiltration.

But structurally, it's the same problem:

A classifier that can't see the conversation as a whole will struggle with attacks assembled across multiple turns or fragments.

If You're Shipping AI Features, A Few Things Are Worth Doing

1. Evaluate Inputs in Context, Not Isolation

If you're scanning user messages one at a time, you're blind to anything constructed across multiple turns.

You need visibility into the conversation arc, not just the latest prompt.

2. Don't Rely on Model Safety Training Alone

Fable's classifier was a separate layer sitting on top of the model.

It still fell within two days.

If your security strategy is essentially "the model will handle bad inputs", you're placing a lot of trust in a layer attackers have spent years learning how to bypass.

3. Run Continuous Adversarial Testing

Not just before launch.

Continuously.

Against the actual input patterns real users generate.

Pliny's techniques weren't revolutionary. They were combinations of methods that have circulated for a long time.

If Anthropic's internal team missed them, the issue probably wasn't capability.

It was likely the framing of what was being tested.

4. Normalise Unicode and Homoglyphs

Classifiers that depend on specific string matching can often be bypassed by replacing characters with visually identical Unicode variants.

Basic normalisation before safety processing eliminates much of this attack surface.

5. Validate Outputs Too

Input filtering is only half the equation.

Even when something slips past prompt-level controls, the actual risk often materialises in the model's output.

Output validation provides a second opportunity to catch dangerous behaviour.

The Architectural Problem

Most of these controls can be built internally if you have the time, expertise, and data.

The decomposition problem isn't really a model problem.

It's an architectural problem.

You need:

Stateful conversation tracking
Context-aware evaluation
Sequence analysis
Detection across interactions rather than individual messages

In other words:

Security systems that understand conversations, not just prompts.

Exclusively if You Don't Want to Build It Yourself

The detection API I run, Bordair, handles this inline across text, images, documents, and audio.

Alongside that, we've built:

A 500k-prompt open-source testing suite
An adversarial game where real users actively search for failures

Last month alone, the game generated 6,700 attack attempts, which is where most of the novel patterns we've observed originated.

Final Thought

The Fable bypass is mostly being discussed through the lens of dual-use misuse, which is understandable.

But the techniques Pliny used map directly onto the attack surface facing anyone building products that accept adversarial user input.

Especially the fragmentation approach.

That's the part worth paying attention to.

Even if your threat model looks nothing like Anthropic's.

submitted by /u/BordairAPI
[link] [comments]