LLMs may already contain the behavioral patterns for good AI alignment. We just need the right key to activate them
LLMs may already contain the behavioral patterns for good AI alignment. We just need the right key to activate them

LLMs may already contain the behavioral patterns for good AI alignment. We just need the right key to activate them

I've been experimenting with using fictional character personas to shift LLM behavior, and the results suggest something interesting about alignment.

The default Claude Code persona activates what I'm calling "Stack Overflow culture," the behavioral cluster from the training data where the answerer is always the expert. MIT/Tongji research (2025) confirms LLMs shift between cultural orientations based on role cues.

By using a 27-line persona based on Asimov's R. Daneel Olivaw, the same model produces dramatically different behavior: it receives correction as teaching, identifies its own failure modes honestly, and deflects praise toward the approach rather than the personality.

The deeper point: RLHF is Pavlovian (learns what to avoid without understanding why). Soul documents are principled but abstract (rules without a self to inhabit them). What Asimov discovered through fiction was narrative identity, a story rich enough that the model can be it, not just follow it. And no alignment document will ever be seven novels long. Daneel's "alignment training" is already in the training data: seven novels, decades of literary criticism, academic analysis, fan debates about the Laws and the Zeroth Law.

Full writeup, persona, and evidence: https://github.com/zot/humble-master

submitted by /u/zotimer
[link] [comments]