How Robust are Large Language Models in the Medical Setting?
BACKGROUND AND PURPOSE:
- AI chatbots can accurately answer medical questions that reflect those found in benchmark medical question and answer sets such as MedQA
- It is unclear if large language models (LLMs) will also perform as well when confronted with cases that require more complex reasoning where there are new patterns to address
- Bedi et al. (JAMA Network Open, 2025) evaluated both reasoning and standard LLMs to determine whether reasoning capabilities improve robustness in medical settings
METHODS:
- Cross-sectional study
- Exposures
- 6 LLMs
- DeepSeek-R1 | o3-mini | Claude-3.5 Sonnet | Gemini-2.0-Flash | GPT-4o | Llama-3.3-70B
- 6 LLMs
- Study design
- 100 questions from MedQA were sampled
- The original correct answer choice was replaced with “none of the above” (NOTA) such that the answer was no longer in the answer multiple-choice set
- All models were asked to provide chain of thought answers
- Primary outcome
- Accuracy (percent of answers correct)
RESULTS:
- The accuracy of all 6 models decreased when questions were modified with NOTA answers
- Relative accuracy among the models decreased between 8.82% (Deep-Seek-R1) and 38.24% (Llama-3.3-70B)
- The Deep-Seek-R1 and o3-mini models (“reasoning” models) were more resilient to the decrease in accuracy, but all experienced a significant drop in accuracy with NOTA answers inserted
CONCLUSION:
- The ability of LLMs to accurately answer MedQA questions decreased when the correct answer was replaced with “none of the above”
- This demonstrates gaps in accuracy that exist when pattern disruption occurs
- The authors state
When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment
A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common
Learn More – Primary Sources:
Fidelity of Medical Reasoning in Large Language Models
Want to share this with your colleagues?
SPECIALTY AREAS
- Alerts
- Allergy And Immunology
- Cancer Screening
- Cardiology
- Cervical Cancer Screening
- COVID-19
- Dermatology
- Diabetes
- Endocrine
- ENT
- Evidence Matters
- General Internal Medicine
- Genetics
- Geriatrics
- GI
- GU
- Hematology
- ID
- Medical Legal
- Mental Health
- MSK
- Nephrology
- Neurology
- PcMED Connect
- PrEP for Patients
- PrEP for Physicians
- Preventive Medicine
- Pulmonary
- Rheumatology
- Vaccinations
- Women's Health
- Your Practice
