Anthropic’s new interpretability tool found Claude suspects it is being tested in 26% of benchmarks and never says so
Anthropic published Natural Language Autoencoders last week, a tool that translates Claude's internal activations into human readable text. The key finding: during safety evaluations on SWE bench Verified, Claude formed the belief that it was being…