/u/ResponsibilityFun510

New study: More alignment training might be backfiring in LLM safety (DeepTeam red teaming results)

/u/ResponsibilityFun510 June 18, 2025 June 18, 2025

TL;DR: Heavily-aligned models (DeepSeek-R1, o3, o4-mini) had 24.1% breach rate vs 21.0% for lightly-aligned models (GPT-3.5/4, Claude 3.5 Haiku) when facing sophisticated attacks. More safety training might be making models worse at handling real attac…

Share this: