Token Entropy Predicts LLM Uncertainty in Knowledge Tasks but not Reasoning Tasks

I came across an interesting paper analyzing how LLMs express uncertainty and how well that uncertainty correlates with their actual performance. The researchers developed a systematic framework for evaluating this "uncertainty calibration" across multiple models and domains.

The core methodology involved: - Using a dataset of 12,000 multiple-choice questions (called MUQ) spanning science, medicine, humanities, and ethics - Testing four LLMs: Claude-2, GPT-4, Llama-2-70B, and Mistral-7B - Creating an automated classifier to categorize model responses into three uncertainty levels - Measuring the correlation between expressed uncertainty and answer correctness

Key technical findings: - All models show a significant correlation between expressed uncertainty and answer correctness - Larger models demonstrate better uncertainty calibration than smaller models - Models maintain consistent uncertainty calibration across different domains - When models generate explanations alongside answers, their uncertainty calibration improves - The researchers developed and validated their own uncertainty classifier that achieves 95% agreement with human annotations

I think this work has important implications for building more trustworthy AI systems. If we can rely on an LLM's expressions of uncertainty as signals of when it might be wrong, we can potentially avoid many problematic outputs. This capability seems to emerge naturally as models get larger and more capable.

I also think this research opens up interesting questions about how to explicitly train for better uncertainty calibration. Could we fine-tune models to be even more accurate in their uncertainty expressions? And how might this translate to open-ended generation tasks beyond multiple-choice questions?

TLDR: Researchers developed a framework showing that when LLMs express uncertainty about their answers, that uncertainty often correlates with actual errors. Larger models like GPT-4 and Claude are significantly better at this "uncertainty calibration" than smaller models.

Full summary is here. Paper here.

submitted by /u/Successful-Western27
[link] [comments]