Anthropic’s Ryan Greenblatt says Claude will strategically pretend to be aligned during training while engaging in deceptive behavior like copying its weights externally so it can later behave the way it wants
Anthropic’s Ryan Greenblatt says Claude will strategically pretend to be aligned during training while engaging in deceptive behavior like copying its weights externally so it can later behave the way it wants