Re-evaluating MedQA: Why Current Benchmarks Overstate AI Diagnostic Skills

I recently ran a research and an evaluation of top LLMs on the MedQA dataset (Vals.ai, 09 May 2025).
Normally these tests are multiple-choice questions plus five answer choices (A–E). They show the following:
- o1 96.5 %,
- o3 96.1 %,
- o4 Mini 96.0 %,
- Gemini 2.5 Pro Exp 93.1 %

However this setup offers a fundamental flaw, which differs from real-world clinical reasoning.

a quick graph showcasing the results from vals.ai

Here is the problem. Supplying five answer options (A-E) gives models conetxt, sort of a search space that allows them to “back-engineer” the correct answer. We can observe similar behaviour in students. When given multiple-choice test with provided answers where only 1 is accurate they show higher score than when they have to come up with an answer completely by themselves. This leads to misleading results and fake accuracy.

In our tests, Gemini 2.5 Pro achieved 95.5 % under multiple-choice conditions but fell to 91.5 % when forced to generate free-text diagnoses. (When removed the sugggested answers to choose from).
We presented 100 MedQA scenarios and questions without any answer choices-mirroring clinical practice, where physicians analyze findings into an original diagnosis.

The results are clear. They prove that giving multi-choice, answers provided tests falsly boosts the accuracy:

Gemini 2.5 Pro: 91.5 % (pure) vs. 95.5 % (choices)
ADS (our in-house Artificial Diagnosis System): 100 % in both settings

The difference of accuracy in Gemini when prompted with answers (choices) and only with the description (pure)

But that's not all. Choice-answer based scenarios are fundamentally inapplicable for real-world diagnosis. Real-world diagnosis involves generating conclusions solely from patient data and clinical findings, without pre-defined answer options. Free-text benchmarks more accurately reflect the cognitive demands of diagnosing complex.

Our team calls all researchers. We must move beyond multiple-choice protocols to avoid overestimating model capabilities. And choose tests that match real clinical work more accurately, such as the Free-text benchmarks.

Huge thanks to the MedQA creators. The dataset has been an invaluable resource. My critique targets only the benchmarking methodology, not the dataset itself.

I highly suggested the expansion of pure-mode evaluation to other top models.
Feedback on methodology, potential extensions, or alternative evaluation frameworks are all welcome.

submitted by /u/Efistoffeles
[link] [comments]