How does GPT-4 do in the medical domain?

I got to play around with its multimodal capabilities on some medical images!

Plus a recent Microsoft paper examined its text understanding and got SOTA results on USMLE medical exams!

A quick thread ↓
As I showed earlier, I had the chance last week to play around with GPT-4's multimodal capabilities:
I also tried some medical images too! Here I started with some histopathology. I passed in an H&E image of prostate cancer and asked GPT-4 to describe it. It knew it was an H&E image of glandular tissue but was unable to identify it as low grade prostate cancer. Image
Here I passed in an image of invasive lobular carcinoma with characteristic single file lines of tumor nuclei. It fails to notice this unfortunately not matter how hard I try. ImageImageImage
Here is an example of a glioblastoma (severe brain tumor). It has a characteristic feature again that suggests the glioblastoma diagnosis (pseudopalisading necrosis) but it fails to notice that. It does realize the presence of what looks like tumor nuclei. Image
This image shows H&E of basal cell carcinoma (skin cancer). GPT-4 notices that it is of skin but cannot identify the pathology. ImageImage
Overall though, GPT-4 mostly refuses to provide anything similar to a diagnosis. Here is one such example with and X-ray image. Image
My conclusion on the multimodal side is that GPT-4 is a impressive first step towards multimodal medical understanding, but its understanding right now is fairly rudimentary, and there is a lot of room to improve here.
On the text side of things, however, the situation is different. In a recent paper from Microsoft Research, "Capabilities of GPT-4 on Medical Challenge Problems", GPT-4 obtains SOTA on USMLEs (medical student exams), significantly outperforming GPT 3.5. Image
Other benchmark datasets were tested as well, with GPT-4 again reaching SOTA for most of them. Image
This was all done without any sophisticated prompting techniques, as shown here Image
One may worry the high performance is due to data contamination. Interestingly this paper performed a memorization analysis, and they didn't find any of the tested USMLE questions with their memorization detection (though it doesn't 100% confirm no memorization). Image
Plus the USMLE material is behind paywall and probably unlikely to be in the GPT4 training set anyway. Image
Overall, seems the medical understanding of text-only GPT-4 is significantly improved & multimodal GPT-4 has rudimentary understanding.
Many more experiments should be done to study GPT-4's medical knowledge/reasoning. Some previous studies using GPT-3 concluded domain/task-specific fine-tuned model are better, and I wonder if the conclusion changes now with GPT-4.

#MedTwitter #PathTwitter
If you like this thread, please share!

Consider following me for AI-related content! → @iScienceLuvr

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tanishq Mathew Abraham

Tanishq Mathew Abraham Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @iScienceLuvr

Mar 16
I got to try GPT-4's multimodal capabilities and it's quite impressive! A quick thread of examples...

Let's start out with solving a CAPTCHA, no big deal
It can explain memes quite well! Here it is explaining an AI-generated meme I shared recently.

(The AIs will create their own memes and explain it to us humans 😂)
Here is another awesome example
Read 8 tweets
Mar 14
GPT-4 release
Med-PaLM2 announcement
PaLM API release
Claude API release Image
Oh I forgot ChatGLM! 😅
Read 4 tweets
Feb 28
Claude, @AnthropicAI's powerful ChatGPT alternative, was trained with "Constitutional AI".

Constitutional AI is particularly interesting since it uses less human feedback than other methods, making it more scalable.

Let's dive into how Constitutional AI works in 13 tweets!
Constitutional AI (CAI) is based on:
1. Supervised Fine-Tuning (SFT)
2. Reinforcement Learning from Human Feedback (RLHF).

If you don't know how SFT & RLHF work, you should first check out my thread on the topic 😉 (1/13)
The goal is to build AI assistants that follow certain "constitutional principles" to make models less harmful (generating offensive outputs, reinforcing social biases,etc.)

We can use AI feedback & supervision to follow these principles & limit the human feedback needed. (2/13)
Read 17 tweets
Feb 21
So, I've heard people say anyone could have built ChatGPT. I think this is disingenuous.

ChaGPT isn't just GPT-3 w/ a chat interface on top of it.

The closest base model on the OpenAI API is probably text-davinci-003, but it was only released a day before ChatGPT! (1/9) Image
Maybe someone could have created a model like text-davinci-003?

Well, ChatGPT/text-davinci-003 are trained with lots and lots of human feedback, which is why it does so well. That's not easy for anyone to obtain! (2/9)
OpenAI is clearly a leader in utilizing human feedback for improved models. They invented RLHF, one of the leading approaches, which powers ChatGPT.

On a related note, claiming OpenAI just scaled up existing work is ignoring OpenAI's expertise in utilizing human feedback. (3/9) Image
Read 10 tweets
Dec 28, 2022
Are you wondering how large language models like ChatGPT and InstructGPT actually work?

One of the secret ingredients is RLHF - Reinforcement Learning from Human Feedback.

Let's dive into how RLHF works in 8 tweets!
Large language models (LLMs) are trained w/ self-supervised learning using next token prediction which actually makes it bad for instruction following. This example from OpenAI's blog exemplifies how GPT-3 succeeds at next-token prediction but fails at instruction-following. 1/8
What if we could use human-annotated data to improve our language model? One approach, known as supervised fine-tuning (SFT), is to take your pretrained LLM and fine-tune it with (prompt, human-written response) pairs. 2/8
Read 12 tweets
Nov 16, 2022
I will attempt to explain the basic idea of how diffusion models work!

... in only 15 tweets! 😲

Let's get started ↓
Diffusion models are *generative* models, which simply means given some example datapoints (your training dataset), generate more like it.

For example, given cute dog images, generate more cute dog images! (1/15)
There are many generative models. GANs (like the ones powering thispersondoesnotexist.com) are image generative models, GPT-3 is also a generative model, but for text. So just keep in mind that while we'll talk about images, the general principles can apply to other domains. (2/15)
Read 20 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(