Adam Rodman Profile picture
Jun 16 30 tweets 13 min read Twitter logo Read on Twitter
Can GPT-4 solve really hard medical cases and come up with a good list of differential diagnoses?

@zahirkanjee @byrondcrowe and my study is out in @JAMA_current , and the short answer is, “Yes.”

But what does this all mean? 🧵⬇️
First, the topline results – in running all of the post-2021 published @NEJM clinicopathologic conferences, ChatGPT got the final diagnosis in 39% of cases, and had the final diagnosis in its differential in 64% of cases.
And using a previously-published scale of “quality,” there were good differentials.

Are these numbers good? Most of us don’t want a doctor who’s right only 40% of the time.
I think the answer is yes.

These cases are INCREDIBLY difficult. I’m not aware of any literature looking at the actual rates of humans solving CPCs, but the very brave authors of this study (bmcmedinformdecismak.biomedcentral.com/articles/10.11…) solved them together and only got 28% right.
ChatGPT performs just as well as the top commercial differential diagnosis generators, and that’s with no special medical training (of course we don’t even KNOW GPT-4's training). It also develops more “useful” lists of differentials, accepting the subjectivity of that judgment.
So what does this mean? Will you be going to see Dr. GPT soon? After all, we started looking into this at the same time me and @ZSchoepflinMD hosted Dr. GPT as an expert discussant at the BIDMC_IM CPC
The answer is “no” (or rather “not yet.”) CPCs are best thought of as very, very hard, and very, very nerdy puzzles. The data is all curated by a human expert, for other humans to solve (and learn from, which is their primary purpose).
Real life diagnosis is not like this at all – data has to be gathered via a hypotheticodeductive process and then mentally sorted and prioritized with considerable amounts of epistemic uncertainty @andrewolsonmd
One thing I think our study DOES suggest is that doctors who are “stumped” on their cases right now could give GPT-4 a problem representation in order to broaden their differential (eg, using it for debiasing). Which does appear to be how a lot of people are using it already.
And of course, I have to point out that ChatGPT is *NOT* HIPAA complaint.
Now for the fun part – the history! Where does our study fit into the larger literature? And what are the next steps?
Doctors have been trying to build machines that think like doctors for over a century. I love to show this quote to people (h/t @Rochalimaea because it’s from 1918!!! Image
But the ostensible beginning of the mission to build a diagnostic artificial intelligence started in 1959, with the famous paper by Ledley and Lusted in Science – The Reasoning Foundations of Medical Diagnosis (science.org/doi/10.1126/sc…)
It is a difficult read, but it basically set out a roadmap to build a diagnosis generator – using epidemiological data to establish the pre-test probability, and then seeing how different findings, tests, and symptom complexes affect the post-test probability.
And they used the NEJM CPCs as their model of showing this was possible.
Since 1959, there have been dozens of “differential diagnosis generators”. The most famous was INTERNIST-I in the 1970s and 1980s, which tried to model the mind of @PittDeptofMed Jack Myers (nejm.org/doi/full/10.10…) Image
(also love that the headline is "Artificial Intelligence Comes of Age" from 1978! another @Rochalimaea hobby of finding how repetitive our narratives are about technology and AI)
And differential generators are still around today – the most widely used and studied in Isabel (isabelhealthcare.com). And when these things are studied, they use (drumroll please) NEJM CPCs!
TL;DR #1 – solving complex cases (in particular, NEJM CPCs) has been a standard – a somewhat arbitrary one – for testing differential generators since the concept was developed. In this sense, ChatGPT performs with the best of them.
What about the LLM literature? Google and OpenAI have talked about the success of their LLMs in terms of their ability on multiple choice question tests (like passing the USMLE pubmed.ncbi.nlm.nih.gov/36812645/ or failing GI journals.lww.com/ajg/Abstract/9…)
Is this a good benchmark?

Of course not!

The ability to pass a standardized test tells us very little about how it makes decisions (even though the medical profession still loves them). And I’m not the only one who thinks that -- eg, @peteratmsr
Should solving CPCs be a good benchmark? It's what we’ve done for over 60 years. And even if it’s better than MCQs, I think it’s time to move on (and yes, I realize the irony, since I did this study!)
Any medical professional who has dabbled with GPT-4 more than a little bit knows that it can show uncanny insights about clinical reasoning – comparing and contrasting illness scripts and seeming to understand Bayesian reasoning and how new information changes probabilities.
Ledley and Lusted had a very mechanistic understanding of how diagnoses were made. But since that time, we’ve learned so much more about clinical reasoning, especially from cognitive psychology.
I personally think that this technology is advanced enough RIGHT NOW that we need to move on from studying it like a differential generator. We need to start studying actual reasoning processes and developing new benchmarks @BageLeMage
This is important because soon – not now, and probably not with GPT-4 – there are going to be studies with LLMs doing decision support with real patient data. And so much more goes into reasoning than just providing a rank-ordered differential.
So TL;DR #2 – our @JAMA_current study is at an intersection of the differential generator literature, and a new medical LLM literature. And it underscores the need to start thinking critically about how we’re going to evaluate these systems.
For those of you still reading, I came at this from an unusual angle -- I'm a doctor and educator, but I'm also a historian with a focus on the development of diagnosis.

I host a history podcast with @ACPinternists called @BedsideRounds (bedsiderounds.org) Image
I have a whole series on the development of diagnosis (with three episodes covering CDS and the birth of AI in the next two months):

1: bedside-rounds.org/episode-59-cry…
2: bedside-rounds.org/episode-63-sig…
3: bedside-rounds.org/episode-64-a-v…
4: bedside-rounds.org/episode-68-the…
5: bedside-rounds.org/episode-69-the…
6: bedside-rounds.org/episode-72-pro…

And if you're a member of @ACPIMPhysicians you can get CME/MOC credits just for listening at acponline.org/BedsideRounds

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Adam Rodman

Adam Rodman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AdamRodmanMD

Nov 14, 2022
Since my thread on the historicity of the exam can gained some traction, here's a reading list if you're interested in gaining perspective on the nature of clinical reasoning -- rather than "just so" stories about imagined halcyon pasts (the era of "the Giants")
First a beautiful essay by Faith Fitzgerald (it's only two pages): bumc.bu.edu/facdev-medicin…

"What does curiosity have to do with the humanistic practice of medicine? Couldn’t it just convert
patients into objects of analysis? I believe that it is
curiosity that converts strangers
(the objects of analysis) into people we can empathize with."

Faith implicitly groks that our "scientific" interest in patients (the Foucauldian "clinical gaze") takes something AWAY from that human relationship. Curiosity might inculcate us.
Read 16 tweets
Nov 13, 2022
So many arguments about what's wrong with medicine today are predicated on imagined (and inaccurate) histories. Let's take some examples from my colleagues who imagine a "golden" age of the exam:
The physical exam is barely older than modern diagnostic tests. For example, the neurological exam was developed in the 1890s, the same time as the x-ray. We were ALREADY performing blood spears and gram stains and checking hematocrits when the neurological exam was developed.
Nineteenth century physicians didn't even see a difference between performing an exam and other tests; they were all the same to them! There's wonderful language describing increased white blood cells on a smear in the same way you describe exam findings!
Read 10 tweets
Jul 28, 2022
I want to keep highlighting some of the amazing speakers we have at the @iMedEducation #DigitalEducation2022 conference, held virtually on October 7th, and in person in Boston on October 8th!

Next up -- @AshleyGWinter!

cmecatalog.hms.harvard.edu/digital-educat…
Everyone has a professional or educational message that we want to get out to the world.

@AshleyGWinter is an expert at education and advocacy for sexual health and sexual medicine. She is going to be sharing her insights about this journey for all of us!
We can't wait to learn from her experience!

Read 4 tweets
May 12, 2022
Why are medical podcasts like @thecurbsiders, @BehindTheKnife, @emcrit, and @AFPpodcast so popular for learning? And who is making them? And can they be trusted?

We listened to (and coded) the top 100 podcasts on the Apple podcasts US medicine chart to find out!

A 🧵⬇️ Image
There were 2⃣ inspirations for this study.

@ShreyaTrivediMD and I at @iMedEducation think that what makes digital education unique from eg an uploaded lecture on YouTube is that that it is produced as part of a virtual community of practice and not traditional institutions. Image
So we had a hypothesis: the most popular medical podcasts would *not* be produced by medical schools, residency programs, or other large institutions, but rather by individuals (or separate companies/nonprofits).
Read 15 tweets
Aug 24, 2021
Almost exactly a year ago, I had a modestly controversial tweet about routine daily physical exams -- and about how we should probably spend more time actually talking to our patients daily rather than pretending to examine then.

Well, now that angry tweet is a point-counterpoint-rebuttal series in @JHospMedicine!

The first piece is by me and @ShaneWarnockMD, and I cut right to the point: Routine daily physical exams in hospitalized patients are a waste of time.

🧵⬇️

or
journalofhospitalmedicine.com/jhospmed/artic…
1⃣: the exam was historically developed as a DIAGNOSTIC test. And it remains an incredibly great diagnostic test in many instances, with innumerable examples validated through both physician experience, and more recently epidemiological studies (think McGee and Rat Clin Exam)
Read 25 tweets
Jul 17, 2021
My (preaching to the choir) 🔥 take: digital educational skills -- whether teaching on #MedTwitter, podcasting, or making videos -- are essential #meded skills for the 21st century. And we can teach these to future educators.

A Tweetorial🧵:
Last year, @ShreyaTrivediMD @StaciSaundersMD and I at @iMedEducation started a curriculum to teach digital educational skills to our @BIDMC_IM residents.

We just published this article going over our curriculum and providing tips for you to do it too: pubmed.ncbi.nlm.nih.gov/34013623/
We ended up integrating our curriculum into a pre-existing one-to-two week Senior Teacher rotation, which had two great benefits:

1⃣MUCH easier to get off the ground
2⃣Stressed that traditional educational principles are transferable for teaching digitally
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(