Ph.D. Student @UCBerkeley | Prev. Research Intern @MSFTResearch Health Futures
May 24 • 4 tweets • 1 min read
🩺Medical benchmarks measure if LLMs get the correct final diagnosis. True clinical reasoning requires sequential belief updating: does the model revise its beliefs appropriately as new evidence appears?
New preprint: arxiv.org/abs/2505.22919
We introduce ER-Reason, a dataset of 25,174 de-identified clinical notes from 3,437 patients that supports evaluation across all stages of the emergency department (ED) workflow: triage intake, treatment selection, disposition planning, and final diagnosis.