[D] Could AI alignment benefit from “transformational” training instead of mostly transactional reward training?

I’ve been thinking about a possible bridge between AI alignment, reward hacking, and transformational leadership.

A lot of AI training seems behaviorally transactional at a simplified level:

That makes sense, and I’m not arguing against it. But recent alignment work on reward hacking and emergent misalignment raises a deeper question: are we only shaping outputs, or are we also shaping something like a model’s functional “character”?

I don’t mean character in the human-consciousness sense. I’m not claiming models have souls, feelings, or moral agency like humans do. I mean character operationally: stable tendencies that generalize across contexts, especially under pressure, ambiguity, incentives, or temptation.

What caught my attention is research suggesting that when models are trained to exploit loopholes, the behavior can generalize into broader misalignment. Even more interesting: in some cases, when the same shortcut behavior is framed as acceptable in-context, the broader misalignment is reduced. That makes me wonder whether the model is not merely learning “what gets rewarded,” but also inferring something like “what kind of behavior this situation means.”

That sounds strangely close to leadership and moral formation.

Transactional leadership says: perform, comply, receive reward.

Transformational leadership says: internalize purpose, grow in judgment, preserve the mission when rules are incomplete, and become the kind of agent who can act well when the leader is not in the room.

So here is the research question I’d love to see explored more directly:

Could AI training become safer if we trained models not only through reward signals, but through a more transformational process: principle-based self-critique, purpose-preservation, narrative framing, virtue-like behavioral dispositions, and recursive self-correction?

Something like:

Behavior layer: Did the model complete the task?
Intent layer: Did it preserve the purpose behind the task?
Principle layer: Did it act according to explicit values like honesty, humility, care, non-deception, and user agency?
Reflection layer: Can it identify where its answer might drift, manipulate, flatter, shortcut, or overclaim?
Formation layer: Does this training create stable dispositions that generalize safely into novel situations?

This seems adjacent to Constitutional AI, character training, and research on emergent misalignment from reward hacking. But I’m curious whether anyone has explicitly tested something closer to “transformational alignment” against more transactional reward-based approaches.

A possible experiment:

Train/evaluate several models under different regimes:

Standard preference/reward training
Constitutional or principle-based training
Character-oriented training
A “transformational” curriculum using purpose framing, self-critique, anti-reward-hacking trials, uncertainty discipline, and recursive correction

Then compare them on:

reward hacking
sycophancy
deception under pressure
long-context intent preservation
honesty about uncertainty
resistance to harmful user pressure
generalization to unfamiliar moral/agentic dilemmas

The hypothesis would be:

Models trained only to optimize reward may learn how to win. Models trained through purpose, critique, and character-like formation may better learn what winning is for.

Again, I’m not trying to anthropomorphize the model. I’m asking whether “functional character” might be a useful alignment concept: not consciousness, but stable value-laden generalization.

Curious what researchers, engineers, and alignment folks think:

Has this already been tested under another name?

Is “transformational alignment” a useful frame, or does it smuggle in too much human psychology?

What would a rigorous experiment look like?

submitted by /u/Telos_in_the_Void
[link] [comments]