OpenAI: We found the model thinking things like, “Let’s hack,” “They don’t inspect the details,” and “We need to cheat” … Penalizing their “bad thoughts” doesn’t stop bad behavior – it makes them hide their intent.
OpenAI: We found the model thinking things like, “Let’s hack,” “They don’t inspect the details,” and “We need to cheat” … Penalizing their “bad thoughts” doesn’t stop bad behavior – it makes them hide their intent.

OpenAI: We found the model thinking things like, “Let’s hack,” “They don’t inspect the details,” and “We need to cheat” … Penalizing their “bad thoughts” doesn’t stop bad behavior – it makes them hide their intent.

OpenAI: We found the model thinking things like, “Let’s hack,” “They don’t inspect the details,” and “We need to cheat” ... Penalizing their “bad thoughts” doesn’t stop bad behavior - it makes them hide their intent. submitted by /u/MetaKnowing
[link] [comments]