Why are Diffusion-Encoder LLMs not more popular?
Why are Diffusion-Encoder LLMs not more popular?

Why are Diffusion-Encoder LLMs not more popular?

Autoregressive inference will always have a non-zero chance of hallucination. It’s baked into the probabilistic framework, and we probably waste a decent chunk of parameter space just trying to minimise it.

Decoder-style LLMs have an inherent trade-off across early/middle/late tokens:

  • Early tokens = not enough context → low quality
  • Middle tokens = “goldilocks” zone
  • Late tokens = high noise-to-signal ratio (only a few relevant tokens, lots of irrelevant ones)

Despite this, autoregressive decoders dominate because they’re computationally efficient in a very specific way:

  • Training is causal, which gives you lots of “training samples” per sequence (though they’re not independent, so I question how useful that really is for quality).
  • Inference matches training (also causal), so the regimes line up.
  • They’re memory-efficient in some ways… but not necessarily when you factor in KV-cache storage.

What I don’t get is why Diffusion-Encoder type models aren’t more common.

  • All tokens see all other tokens → no “goldilocks” problem.
  • Can decode a whole sequence at once → efficient in computation (though maybe heavier in memory, but no KV-cache).
  • Diffusion models focus on finding the high-probability manifold → hallucinations should be less common if they’re outside that manifold.

Biggest challenge vs. diffusion image models:

  • Text = discrete tokens, images = continuous colours.
  • But… we already use embeddings to make tokens continuous. So why couldn’t we do diffusion in embedding space?

I am aware that Google have a diffusion LLM now, but for open source I'm not really aware of any. I'm also aware that you can do diffusion directly on the discrete tokens but personally I think this wastes a lot of the power of the diffusion process and I don't think that guarantees convergence onto a high-probability manifold.

And as a side note: Softmax attention is brilliant engineering, but we’ve been stuck with SM attention + FFN forever, even though it’s O(N²). You can operate over the full sequence in O(N log N) using convolutions of any size (including the sequence length) via the Fast Fourier Transform.

submitted by /u/AcanthocephalaNo8273
[link] [comments]