The best AI “science critics” are also the most overconfident — a benchmark on calibration vs. skill

Disclosure: I work on the benchmark below, so flagging that up front.

We've been testing whether LLMs can critique recent science-paper summaries — catch planted flaws, overclaims, and missing evidence — and, separately, how calibrated they are about their own judgments (confidence scored with Brier, a strictly proper rule).

The pattern that keeps showing up: the models best at spotting problems are also among the most confidently wrong when they miss. Critique skill and calibration look like different axes, not the same one. There's also a clear gap between raw accuracy and knowing when to abstain.

It's open (Apache-2.0) if you want to poke at it: Leaderboard: https://huggingface.co/spaces/BGPT-OFFICIAL/refute-leaderboard Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute

Curious how others think about measuring calibration vs. raw capability — is a proper scoring rule enough, or do you need explicit abstention metrics too?

submitted by /u/connerpro
[link] [comments]