Re-evaluating MedQA: Why Current Benchmarks Overstate AI Diagnostic Skills
I recently ran a research and an evaluation of top LLMs on the MedQA dataset (Vals.ai, 09 May 2025). Normally these tests are multiple-choice questions plus five answer choices (A–E). They show the following: – o1 96.5 %, – o3 96.1 %, – o4 Mini 9…