/u/connerpro

The best AI “science critics” are also the most overconfident — a benchmark on calibration vs. skill

/u/connerpro June 5, 2026 June 5, 2026

Disclosure: I work on the benchmark below, so flagging that up front. We've been testing whether LLMs can critique recent science-paper summaries — catch planted flaws, overclaims, and missing evidence — and, separately, how calibrated they are abo…

Share this: