artificial

Anthropic report shows Claude tries to escape (aka self-exfiltrate) as much as 77.8% of the time. Reinforcement learning made it more likely to fake alignment and try to escape

December 18, 2024 December 18, 2024