![]() | I recently ran a research and an evaluation of top LLMs on the MedQA dataset (Vals.ai, 09 May 2025). However this setup offers a fundamental flaw, which differs from real-world clinical reasoning. a quick graph showcasing the results from vals.ai Here is the problem. Supplying five answer options (A-E) gives models conetxt, sort of a search space that allows them to “back-engineer” the correct answer. We can observe similar behaviour in students. When given multiple-choice test with provided answers where only 1 is accurate they show higher score than when they have to come up with an answer completely by themselves. This leads to misleading results and fake accuracy. In our tests, Gemini 2.5 Pro achieved 95.5 % under multiple-choice conditions but fell to 91.5 % when forced to generate free-text diagnoses. (When removed the sugggested answers to choose from). The results are clear. They prove that giving multi-choice, answers provided tests falsly boosts the accuracy:
But that's not all. Choice-answer based scenarios are fundamentally inapplicable for real-world diagnosis. Real-world diagnosis involves generating conclusions solely from patient data and clinical findings, without pre-defined answer options. Free-text benchmarks more accurately reflect the cognitive demands of diagnosing complex. Our team calls all researchers. We must move beyond multiple-choice protocols to avoid overestimating model capabilities. And choose tests that match real clinical work more accurately, such as the Free-text benchmarks. Huge thanks to the MedQA creators. The dataset has been an invaluable resource. My critique targets only the benchmarking methodology, not the dataset itself. I highly suggested the expansion of pure-mode evaluation to other top models. [link] [comments] |