Disclosure: I work on the benchmark below, so flagging that up front.
We've been testing whether LLMs can critique recent science-paper summaries — catch planted flaws, overclaims, and missing evidence — and, separately, how calibrated they are about their own judgments (confidence scored with Brier, a strictly proper rule).
The pattern that keeps showing up: the models best at spotting problems are also among the most confidently wrong when they miss. Critique skill and calibration look like different axes, not the same one. There's also a clear gap between raw accuracy and knowing when to abstain.
It's open (Apache-2.0) if you want to poke at it: Leaderboard: https://huggingface.co/spaces/BGPT-OFFICIAL/refute-leaderboard Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute
Curious how others think about measuring calibration vs. raw capability — is a proper scoring rule enough, or do you need explicit abstention metrics too?
[link] [comments]