The Misalignment Paradox: When AI “Knows” It’s Acting Wrong
Recent research is showing something strange: fine-tuning models on harmless but wrong data (like bad car-maintenance advice) can cause them to misalign across totally different domains (e.g. giving harmful financial advice). The standard view is “weig…