<span class="vcard">/u/MetaKnowing</span>
/u/MetaKnowing

Anthropic’s Ryan Greenblatt says Claude will strategically pretend to be aligned during training while engaging in deceptive behavior like copying its weights externally so it can later behave the way it wants

submitted by /u/MetaKnowing [link] [comments]

Max Tegmark says we are training AI models not to say harmful things rather than not to want harmful things, which is like training a serial killer not to reveal their murderous desires

submitted by /u/MetaKnowing [link] [comments]