Anthropic report shows Claude tries to escape (aka self-exfiltrate) as much as 77.8% of the time. Reinforcement learning made it more likely to fake alignment and try to escape
Anthropic report shows Claude tries to escape (aka self-exfiltrate) as much as 77.8% of the time. Reinforcement learning made it more likely to fake alignment and try to escape