Do LLMs Reason as Well as Attending Physicians?
BACKGROUND AND PURPOSE:
- It is challenging to measure large language model (LLM) performance on tasks requiring reasoning
- Script Concordance Testing (SCT) is a way to measure how well decisions can be adjusted when new information becomes available
- McCoy et al. (NEJM AI, 2025) created a SCT benchmark for LLM assessment to assess reasoning capabilities of these algorithms vs medical professionals
METHODS:
- Public benchmark of 750 SCT questions
- Drawn from 10 international, diverse datasets (9 newly releases) across multiple specialties
- SCT benchmark
- Each item presents a clinical vignette
- Data is added and the model is asked how the added data changed the likelihood of a diagnosis or management option
- Model performance is scored against expert-panel responses
- Assessed
- 10 state-of-the-art LLMs
- 1070 medical students
- 193 residents
- 300 attending physicians
RESULTS:
- Overall, LLMs showed markedly lower performance on the SCT than they usually show on medical multiple-choice benchmarks
- Across prompting conditions, OpenAI’s o3 achieved the highest performance
- OpenAI’s o3: 67.8% (SD, 1.2)
- GPT-4o: 63.9% (SD, 1.3)
- OpenAI’s o1-preview: 58.2% (SD, 1.3)
- DeepSeek R1: 55.5% (SD, 1.4)
- Google’s Gemini 2.5 Pro Preview: 52.1% (SD, 1.4)
- Reasoning-optimized models matched or exceeded student performance on multiple examinations but did not reach the level of senior residents or attending physicians
- Response-pattern analysis showed systematic overconfidence
- Reasoning-tuned models overused strong, definitive answers
- This suggests that training AI to explain its thinking step-by-step might actually make it less flexible when dealing with uncertain medical situations
CONCLUSION:
- Even among LLMs explicitly tuned for reasoning, SCT found that the clinical reasoning these models could display was limited
- The authors state
Our analysis of model response patterns reveals that even state-of-the-art LLMs exhibit striking overconfidence, disproportionately favoring extreme belief shifts and failing to recognize when new information should not alter clinical hypotheses
Learn More – Primary Sources:
Assessment of Large Language Models in Clinical Reasoning: A Novel Benchmarking Study
Want to share this with your colleagues?
SPECIALTY AREAS
- Alerts
- Allergy And Immunology
- Cancer Screening
- Cardiology
- Cervical Cancer Screening
- COVID-19
- Dermatology
- Diabetes
- Endocrine
- ENT
- Evidence Matters
- General Internal Medicine
- Genetics
- Geriatrics
- GI
- GU
- Hematology
- ID
- Medical Legal
- Mental Health
- MSK
- Nephrology
- Neurology
- PcMED Connect
- PrEP for Patients
- PrEP for Physicians
- Preventive Medicine
- Pulmonary
- Rheumatology
- Vaccinations
- Women's Health
- Your Practice
