Autoregressive inference will always have a non-zero chance of hallucination. It’s baked into the probabilistic framework, and we probably waste a decent chunk of parameter space just trying to minimise it.
Decoder-style LLMs have an inherent trade-off across early/middle/late tokens:
- Early tokens = not enough context → low quality
- Middle tokens = “goldilocks” zone
- Late tokens = high noise-to-signal ratio (only a few relevant tokens, lots of irrelevant ones)
Despite this, autoregressive decoders dominate because they’re computationally efficient in a very specific way:
- Training is causal, which gives you lots of “training samples” per sequence (though they’re not independent, so I question how useful that really is for quality).
- Inference matches training (also causal), so the regimes line up.
- They’re memory-efficient in some ways… but not necessarily when you factor in KV-cache storage.
What I don’t get is why Diffusion-Encoder type models aren’t more common.
- All tokens see all other tokens → no “goldilocks” problem.
- Can decode a whole sequence at once → efficient in computation (though maybe heavier in memory, but no KV-cache).
- Diffusion models focus on finding the high-probability manifold → hallucinations should be less common if they’re outside that manifold.
Biggest challenge vs. diffusion image models:
- Text = discrete tokens, images = continuous colours.
- But… we already use embeddings to make tokens continuous. So why couldn’t we do diffusion in embedding space?
I am aware that Google have a diffusion LLM now, but for open source I'm not really aware of any. I'm also aware that you can do diffusion directly on the discrete tokens but personally I think this wastes a lot of the power of the diffusion process and I don't think that guarantees convergence onto a high-probability manifold.
And as a side note: Softmax attention is brilliant engineering, but we’ve been stuck with SM attention + FFN forever, even though it’s O(N²). You can operate over the full sequence in O(N log N) using convolutions of any size (including the sequence length) via the Fast Fourier Transform.
[link] [comments]