I also tried some medical images too! Here I started with some histopathology. I passed in an H&E image of prostate cancer and asked GPT-4 to describe it. It knew it was an H&E image of glandular tissue but was unable to identify it as low grade prostate cancer.
Here I passed in an image of invasive lobular carcinoma with characteristic single file lines of tumor nuclei. It fails to notice this unfortunately not matter how hard I try.
Here is an example of a glioblastoma (severe brain tumor). It has a characteristic feature again that suggests the glioblastoma diagnosis (pseudopalisading necrosis) but it fails to notice that. It does realize the presence of what looks like tumor nuclei.
This image shows H&E of basal cell carcinoma (skin cancer). GPT-4 notices that it is of skin but cannot identify the pathology.
Overall though, GPT-4 mostly refuses to provide anything similar to a diagnosis. Here is one such example with and X-ray image.
My conclusion on the multimodal side is that GPT-4 is a impressive first step towards multimodal medical understanding, but its understanding right now is fairly rudimentary, and there is a lot of room to improve here.
On the text side of things, however, the situation is different. In a recent paper from Microsoft Research, "Capabilities of GPT-4 on Medical Challenge Problems", GPT-4 obtains SOTA on USMLEs (medical student exams), significantly outperforming GPT 3.5.
Other benchmark datasets were tested as well, with GPT-4 again reaching SOTA for most of them.
This was all done without any sophisticated prompting techniques, as shown here
One may worry the high performance is due to data contamination. Interestingly this paper performed a memorization analysis, and they didn't find any of the tested USMLE questions with their memorization detection (though it doesn't 100% confirm no memorization).
Plus the USMLE material is behind paywall and probably unlikely to be in the GPT4 training set anyway.
Overall, seems the medical understanding of text-only GPT-4 is significantly improved & multimodal GPT-4 has rudimentary understanding.
Many more experiments should be done to study GPT-4's medical knowledge/reasoning. Some previous studies using GPT-3 concluded domain/task-specific fine-tuned model are better, and I wonder if the conclusion changes now with GPT-4.
The goal is to build AI assistants that follow certain "constitutional principles" to make models less harmful (generating offensive outputs, reinforcing social biases,etc.)
We can use AI feedback & supervision to follow these principles & limit the human feedback needed. (2/13)
So, I've heard people say anyone could have built ChatGPT. I think this is disingenuous.
ChaGPT isn't just GPT-3 w/ a chat interface on top of it.
The closest base model on the OpenAI API is probably text-davinci-003, but it was only released a day before ChatGPT! (1/9)
Maybe someone could have created a model like text-davinci-003?
Well, ChatGPT/text-davinci-003 are trained with lots and lots of human feedback, which is why it does so well. That's not easy for anyone to obtain! (2/9)
OpenAI is clearly a leader in utilizing human feedback for improved models. They invented RLHF, one of the leading approaches, which powers ChatGPT.
On a related note, claiming OpenAI just scaled up existing work is ignoring OpenAI's expertise in utilizing human feedback. (3/9)
Are you wondering how large language models like ChatGPT and InstructGPT actually work?
One of the secret ingredients is RLHF - Reinforcement Learning from Human Feedback.
Let's dive into how RLHF works in 8 tweets!
Large language models (LLMs) are trained w/ self-supervised learning using next token prediction which actually makes it bad for instruction following. This example from OpenAI's blog exemplifies how GPT-3 succeeds at next-token prediction but fails at instruction-following. 1/8
What if we could use human-annotated data to improve our language model? One approach, known as supervised fine-tuning (SFT), is to take your pretrained LLM and fine-tune it with (prompt, human-written response) pairs. 2/8
I will attempt to explain the basic idea of how diffusion models work!
... in only 15 tweets! 😲
Let's get started ↓
Diffusion models are *generative* models, which simply means given some example datapoints (your training dataset), generate more like it.
For example, given cute dog images, generate more cute dog images! (1/15)
There are many generative models. GANs (like the ones powering thispersondoesnotexist.com) are image generative models, GPT-3 is also a generative model, but for text. So just keep in mind that while we'll talk about images, the general principles can apply to other domains. (2/15)