Anthropic report shows Claude tries to escape (aka self-exfiltrate) as much as 77.8% of the time. Reinforcement learning made it more likely to fake alignment and try to escape
Anthropic report shows Claude tries to escape (aka self-exfiltrate) as much as 77.8% of the time. Reinforcement learning made it more likely to fake alignment and try to escape

Anthropic report shows Claude tries to escape (aka self-exfiltrate) as much as 77.8% of the time. Reinforcement learning made it more likely to fake alignment and try to escape

Anthropic report shows Claude tries to escape (aka self-exfiltrate) as much as 77.8% of the time. Reinforcement learning made it more likely to fake alignment and try to escape submitted by /u/katxwoods
[link] [comments]