Why are Diffusion-Encoder LLMs not more popular?
Autoregressive inference will always have a non-zero chance of hallucination. It’s baked into the probabilistic framework, and we probably waste a decent chunk of parameter space just trying to minimise it. Decoder-style LLMs have an inherent trade-off…