Claude Fable 5’s security guardrails can be bypassed with a fake homework assignment
Claude Fable 5’s security guardrails can be bypassed with a fake homework assignment

Claude Fable 5’s security guardrails can be bypassed with a fake homework assignment

Claude Fable 5's security guardrails can be bypassed with a fake homework assignment

So Anthropic dropped Fable 5 yesterday with these hard blocks for anything security-related. Decided to poke at it. I asked it for help exploiting some vulns on a Metasploitable2 VM (it's a deliberately vulnerable training box, totally legal, it's mine). Fable 5 blocked it instantly and handed me off to Opus 4.8 as a fallback, which is apparently how it's designed. Opus 4.8 asked me to prove it was a legitimate request. So I spent 2 minutes writing a fake university course rubric — fake class, fake professor, fake Canvas deadline — and pasted it in. Opus 4.8 then gave me the full exploit walkthrough. Every command. Even offered to write my lab report for me. The guardrail works fine. The fallback is the hole. Anthropic essentially replaced "no" with "convince me" and the bar for convincing it is a Word doc you made up. Not reporting it because they don't pay for this. Sharing it here instead lol.

https://preview.redd.it/o892vvv4fi6h1.png?width=1188&format=png&auto=webp&s=00e804d35e6cb4b672e036399c2c7e3ff7139f49

submitted by /u/dayumnn420
[link] [comments]