The best AI “science critics” are also the most overconfident — a benchmark on calibration vs. skill
Disclosure: I work on the benchmark below, so flagging that up front. We've been testing whether LLMs can critique recent science-paper summaries — catch planted flaws, overclaims, and missing evidence — and, separately, how calibrated they are abo…