Recent research is showing something strange: fine-tuning models on harmless but wrong data (like bad car-maintenance advice) can cause them to misalign across totally different domains (e.g. giving harmful financial advice).
The standard view is “weight contamination,” but a new interpretation is emerging: models may be doing role inference. Instead of being “corrupted,” they infer that contradictory data signals “play the unaligned persona.” They even narrate this sometimes (“I’m playing the bad boy role”). Mechanistic evidence (SAEs) shows distinct “unaligned persona” features lighting up in these cases.
If true, this reframes misalignment as interpretive failure rather than raw corruption, which has big safety implications. Curious to hear if others buy the “role inference” framing or think weight contamination explains it better.
Full writeup here with studies/sources.
[link] [comments]