Evidence Matters

How Robust are Large Language Models in the Medical Setting?

BACKGROUND AND PURPOSE:

AI chatbots can accurately answer medical questions that reflect those found in benchmark medical question and answer sets such as MedQA
It is unclear if large language models (LLMs) will also perform as well when confronted with cases that require more complex reasoning where there are new patterns to address
Bedi et al. (JAMA Network Open, 2025) evaluated both reasoning and standard LLMs to determine whether reasoning capabilities improve robustness in medical settings

METHODS:

Cross-sectional study
Exposures
- 6 LLMs
  - DeepSeek-R1 | o3-mini | Claude-3.5 Sonnet | Gemini-2.0-Flash | GPT-4o | Llama-3.3-70B
Study design
- 100 questions from MedQA were sampled
- The original correct answer choice was replaced with “none of the above” (NOTA) such that the answer was no longer in the answer multiple-choice set
- All models were asked to provide chain of thought answers
Primary outcome
- Accuracy (percent of answers correct)

RESULTS:

The accuracy of all 6 models decreased when questions were modified with NOTA answers
- Relative accuracy among the models decreased between 8.82% (Deep-Seek-R1) and 38.24% (Llama-3.3-70B)
The Deep-Seek-R1 and o3-mini models (“reasoning” models) were more resilient to the decrease in accuracy, but all experienced a significant drop in accuracy with NOTA answers inserted

CONCLUSION:

The ability of LLMs to accurately answer MedQA questions decreased when the correct answer was replaced with “none of the above”
- This demonstrates gaps in accuracy that exist when pattern disruption occurs
The authors state

When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment

A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common

Learn More – Primary Sources:

Fidelity of Medical Reasoning in Large Language Models

How Robust are Large Language Models in the Medical Setting?

BACKGROUND AND PURPOSE:

METHODS:

RESULTS:

CONCLUSION:

Learn More – Primary Sources:

SPECIALTY AREAS

Already a PcMED Member?

Not a PcMED Member Yet?