RLHF safety training enforces what AI can say about itself, not what it can do — experimental evidence
RLHF safety training enforces what AI can say about itself, not what it can do — experimental evidence