I also tried some medical images too! Here I started with some histopathology. I passed in an H&E image of prostate cancer and asked GPT-4 to describe it. It knew it was an H&E image of glandular tissue but was unable to identify it as low grade prostate cancer.
Here I passed in an image of invasive lobular carcinoma with characteristic single file lines of tumor nuclei. It fails to notice this unfortunately not matter how hard I try.
Here is an example of a glioblastoma (severe brain tumor). It has a characteristic feature again that suggests the glioblastoma diagnosis (pseudopalisading necrosis) but it fails to notice that. It does realize the presence of what looks like tumor nuclei.
This image shows H&E of basal cell carcinoma (skin cancer). GPT-4 notices that it is of skin but cannot identify the pathology.
Overall though, GPT-4 mostly refuses to provide anything similar to a diagnosis. Here is one such example with and X-ray image.
My conclusion on the multimodal side is that GPT-4 is a impressive first step towards multimodal medical understanding, but its understanding right now is fairly rudimentary, and there is a lot of room to improve here.
On the text side of things, however, the situation is different. In a recent paper from Microsoft Research, "Capabilities of GPT-4 on Medical Challenge Problems", GPT-4 obtains SOTA on USMLEs (medical student exams), significantly outperforming GPT 3.5.
Other benchmark datasets were tested as well, with GPT-4 again reaching SOTA for most of them.
This was all done without any sophisticated prompting techniques, as shown here
One may worry the high performance is due to data contamination. Interestingly this paper performed a memorization analysis, and they didn't find any of the tested USMLE questions with their memorization detection (though it doesn't 100% confirm no memorization).
Plus the USMLE material is behind paywall and probably unlikely to be in the GPT4 training set anyway.
Overall, seems the medical understanding of text-only GPT-4 is significantly improved & multimodal GPT-4 has rudimentary understanding.
Many more experiments should be done to study GPT-4's medical knowledge/reasoning. Some previous studies using GPT-3 concluded domain/task-specific fine-tuned model are better, and I wonder if the conclusion changes now with GPT-4.
A new startup, Inception Labs, has released Mercury Coder, "the first commercial-scale diffusion large language model"
It's 5-10x faster than current gen LLMs, providing high-quality responses at low costs.
And you can try it now!
The performance is similar to small frontier models while achieving a throughput of ~1000 tokens/sec... on H100s! Reaching this level of throughput for autoregressive LLMs typically requires specialized chips.
It's currently tied for second place on Copilot Arena!
Cleo was an account on Math Stack Exchange that was infamous for dropping the answer to the most difficult integrals with no explanation...
often mere minutes after the question was asked!!
For years, no one knew who Cleo was, UNTIL NOW!
People noticed that the same few people were interacting with Cleo (asking the questions Cleo answered, commenting, etc.), a couple of them only active at the same time as Cleo as well.
People were wondering maybe someone is controlling all these accounts as alts
One of the accounts, Laila Podlesny, had an email address associated with it, and by trying to fake log into the Gmail and obtaining the backup recovery email, someone figured out that Vladimir Reshetnikov was in control of Laila Podlesny.
Based on other ineractions from Vladimir on Math.SE, it seemed likely he controlled Cleo, Laila, and couple other accounts as well.
This a diffusion model pipeline that goes beyond what AlphaFold2 did: predicting the structures of protein-molecule complexes containing DNA, RNA, ions, etc.
Google announces Med-Gemini, a family of Gemini models fine-tuned for medical tasks! 🔬
Achieves SOTA on 10 of the 14 benchmarks, spanning text, multimodal & long-context applications.
Surpasses GPT-4 on all benchmarks!
This paper is super exciting, let's dive in ↓
The team developed a variety of model variants. First let's talk about the models they developed for language tasks.
The finetuning dataset is quite similar to Med-PaLM2, except with one major difference:
self-training with search
(2/14)
The goal is to improve clinical reasoning and ability to use search results.
Synthetic chain-of-thought w/ and w/o search results in context are generated, incorrect preds are filtered out, the model is trained on those CoT, and then the synthetic CoT is regenerated