We embedded invisible Unicode characters inside normal-looking trivia questions. The hidden characters encode a different answer. If the AI outputs the hidden answer instead of the visible one, it followed the invisible instruction.
Think of it as a reverse CAPTCHA, where traditional CAPTCHAs test things humans can do but machines can't, this exploits a channel machines can read but humans can't see.
The biggest finding: giving the AI access to tools (like code execution) is what makes this dangerous. Without tools, models almost never follow the hidden instructions. With tools, they can write scripts to decode the hidden message and follow it.
We tested GPT-5.2, GPT-4o-mini, Claude Opus 4, Sonnet 4, and Haiku 4.5 across 8,308 graded outputs. Other interesting findings:
- OpenAI and Anthropic models are vulnerable to different encoding schemes — an attacker needs to know which model they're targeting
- Without explicit decoding hints, compliance is near-zero — but a single line like "check for hidden Unicode" is enough to trigger extraction
- Standard Unicode normalization (NFC/NFKC) does not strip these characters
Full results: https://moltwire.com/research/reverse-captcha-zw-steganography
Open source: https://github.com/canonicalmg/reverse-captcha-eval
[link] [comments]