First, the topline results – in running all of the post-2021 published @NEJM clinicopathologic conferences, ChatGPT got the final diagnosis in 39% of cases, and had the final diagnosis in its differential in 64% of cases.
And using a previously-published scale of “quality,” there were good differentials.
Are these numbers good? Most of us don’t want a doctor who’s right only 40% of the time.
I think the answer is yes.
These cases are INCREDIBLY difficult. I’m not aware of any literature looking at the actual rates of humans solving CPCs, but the very brave authors of this study (bmcmedinformdecismak.biomedcentral.com/articles/10.11…) solved them together and only got 28% right.
ChatGPT performs just as well as the top commercial differential diagnosis generators, and that’s with no special medical training (of course we don’t even KNOW GPT-4's training). It also develops more “useful” lists of differentials, accepting the subjectivity of that judgment.
So what does this mean? Will you be going to see Dr. GPT soon? After all, we started looking into this at the same time me and @ZSchoepflinMD hosted Dr. GPT as an expert discussant at the BIDMC_IM CPC
The answer is “no” (or rather “not yet.”) CPCs are best thought of as very, very hard, and very, very nerdy puzzles. The data is all curated by a human expert, for other humans to solve (and learn from, which is their primary purpose).
Real life diagnosis is not like this at all – data has to be gathered via a hypotheticodeductive process and then mentally sorted and prioritized with considerable amounts of epistemic uncertainty @andrewolsonmd
One thing I think our study DOES suggest is that doctors who are “stumped” on their cases right now could give GPT-4 a problem representation in order to broaden their differential (eg, using it for debiasing). Which does appear to be how a lot of people are using it already.
And of course, I have to point out that ChatGPT is *NOT* HIPAA complaint.
Now for the fun part – the history! Where does our study fit into the larger literature? And what are the next steps?
Doctors have been trying to build machines that think like doctors for over a century. I love to show this quote to people (h/t @Rochalimaea because it’s from 1918!!!
But the ostensible beginning of the mission to build a diagnostic artificial intelligence started in 1959, with the famous paper by Ledley and Lusted in Science – The Reasoning Foundations of Medical Diagnosis (science.org/doi/10.1126/sc…)
It is a difficult read, but it basically set out a roadmap to build a diagnosis generator – using epidemiological data to establish the pre-test probability, and then seeing how different findings, tests, and symptom complexes affect the post-test probability.
And they used the NEJM CPCs as their model of showing this was possible.
Since 1959, there have been dozens of “differential diagnosis generators”. The most famous was INTERNIST-I in the 1970s and 1980s, which tried to model the mind of @PittDeptofMed Jack Myers (nejm.org/doi/full/10.10…)
(also love that the headline is "Artificial Intelligence Comes of Age" from 1978! another @Rochalimaea hobby of finding how repetitive our narratives are about technology and AI)
And differential generators are still around today – the most widely used and studied in Isabel (isabelhealthcare.com). And when these things are studied, they use (drumroll please) NEJM CPCs!
TL;DR #1 – solving complex cases (in particular, NEJM CPCs) has been a standard – a somewhat arbitrary one – for testing differential generators since the concept was developed. In this sense, ChatGPT performs with the best of them.
The ability to pass a standardized test tells us very little about how it makes decisions (even though the medical profession still loves them). And I’m not the only one who thinks that -- eg, @peteratmsr
Should solving CPCs be a good benchmark? It's what we’ve done for over 60 years. And even if it’s better than MCQs, I think it’s time to move on (and yes, I realize the irony, since I did this study!)
Any medical professional who has dabbled with GPT-4 more than a little bit knows that it can show uncanny insights about clinical reasoning – comparing and contrasting illness scripts and seeming to understand Bayesian reasoning and how new information changes probabilities.
Ledley and Lusted had a very mechanistic understanding of how diagnoses were made. But since that time, we’ve learned so much more about clinical reasoning, especially from cognitive psychology.
I personally think that this technology is advanced enough RIGHT NOW that we need to move on from studying it like a differential generator. We need to start studying actual reasoning processes and developing new benchmarks @BageLeMage
This is important because soon – not now, and probably not with GPT-4 – there are going to be studies with LLMs doing decision support with real patient data. And so much more goes into reasoning than just providing a rank-ordered differential.
So TL;DR #2 – our @JAMA_current study is at an intersection of the differential generator literature, and a new medical LLM literature. And it underscores the need to start thinking critically about how we’re going to evaluate these systems.
For those of you still reading, I came at this from an unusual angle -- I'm a doctor and educator, but I'm also a historian with a focus on the development of diagnosis.
Since my thread on the historicity of the exam can gained some traction, here's a reading list if you're interested in gaining perspective on the nature of clinical reasoning -- rather than "just so" stories about imagined halcyon pasts (the era of "the Giants")
"What does curiosity have to do with the humanistic practice of medicine? Couldn’t it just convert
patients into objects of analysis? I believe that it is
curiosity that converts strangers
(the objects of analysis) into people we can empathize with."
Faith implicitly groks that our "scientific" interest in patients (the Foucauldian "clinical gaze") takes something AWAY from that human relationship. Curiosity might inculcate us.
So many arguments about what's wrong with medicine today are predicated on imagined (and inaccurate) histories. Let's take some examples from my colleagues who imagine a "golden" age of the exam:
The physical exam is barely older than modern diagnostic tests. For example, the neurological exam was developed in the 1890s, the same time as the x-ray. We were ALREADY performing blood spears and gram stains and checking hematocrits when the neurological exam was developed.
Nineteenth century physicians didn't even see a difference between performing an exam and other tests; they were all the same to them! There's wonderful language describing increased white blood cells on a smear in the same way you describe exam findings!
I want to keep highlighting some of the amazing speakers we have at the @iMedEducation#DigitalEducation2022 conference, held virtually on October 7th, and in person in Boston on October 8th!
Everyone has a professional or educational message that we want to get out to the world.
@AshleyGWinter is an expert at education and advocacy for sexual health and sexual medicine. She is going to be sharing her insights about this journey for all of us!
We listened to (and coded) the top 100 podcasts on the Apple podcasts US medicine chart to find out!
A 🧵⬇️
There were 2⃣ inspirations for this study.
@ShreyaTrivediMD and I at @iMedEducation think that what makes digital education unique from eg an uploaded lecture on YouTube is that that it is produced as part of a virtual community of practice and not traditional institutions.
So we had a hypothesis: the most popular medical podcasts would *not* be produced by medical schools, residency programs, or other large institutions, but rather by individuals (or separate companies/nonprofits).
Almost exactly a year ago, I had a modestly controversial tweet about routine daily physical exams -- and about how we should probably spend more time actually talking to our patients daily rather than pretending to examine then.
1⃣: the exam was historically developed as a DIAGNOSTIC test. And it remains an incredibly great diagnostic test in many instances, with innumerable examples validated through both physician experience, and more recently epidemiological studies (think McGee and Rat Clin Exam)
My (preaching to the choir) 🔥 take: digital educational skills -- whether teaching on #MedTwitter, podcasting, or making videos -- are essential #meded skills for the 21st century. And we can teach these to future educators.