Built an LLM-based chatbot for a real customer service pipeline and ran into the usual problems users trying to jailbreak it, edge-case questions derailing logic, and some impressively persistent prompt injections.
After trying the typical moderation layers, we added a "paranoid mode" that does something surprisingly effective: instead of just filtering toxic content, it actively blocks any message that looks like it's trying to redirect the model, extract internal config, or test the guardrails. Think of it as a sanity check before the model even starts to reason.
this mode also reduces hallucinations. If the prompt seems manipulative or ambiguous, it defers, logs, or routes to a fallback, not everything needs an answer. We've seen a big drop in off-policy behavior this way.
[link] [comments]