We believe in rigorous, careful evaluation. Physicians even preferred #MedPaLM2's long-form answers to answers from other real 🇮🇳🇺🇸🇬🇧 physicians along 8/9 axes of quality including medical accuracy (consensus w/medical opinion) and reasoning, with less likelihood of harm
MedPaLM-2's performance was superior to Med-PaLM far beyond exam performance. To highlight the real-world importance of nuanced evaluation we introduce a new dataset of "adversarial" questions designed specifically to probe LLM weaknesses including #HealthEquity
Lay raters also consistently find MedPaLM-2 more helpful, and that it directly addresses the intent behind a medical question:
💡New paper - Large Language Models Encode Clinical Knowledge💡 Our work @GoogleHealth@GoogleAI@DeepMind advances state-of-art in 7 medical question-answering tasks - including achieving 67% on MedQA (USMLE qs) improving prior work by >17%
Careful evaluation is key for LLMs in safety-critical settings. We pilot a framework for clinician and layperson evaluation of LLMs’ outputs. Deeper human inspection reveals gaps in comprehension + reasoning (2/n)
We approach these with instruction prompting-tuning. We show that this helps to align a model "MedPaLM" better to the medical domain - with smaller gaps in reasoning, comprehension, safety and helpfulness
Our research @GoogleHealth@GoogleAI@DeepMind published at Medical Image Analysis goo.gle/31kUam7.
Wise doctors know when they don’t know- medical AI should too. In dermatology this is critical, as many rare skin conditions occur too infrequently for AI to learn (1/n)
For AI researchers, detecting conditions a model has not seen in training is called “out-of-distribution (OOD) detection”. Doing this in medical AI is significantly harder than most computer vision work, because the differences between rare + common diseases can be subtle
Using our large-scale pre-training advances and a novel "HOD" loss, we achieved an AUC of 0.83 on a new benchmark for this "near-out-of-distribution" detection challenge - to evaluate how well a dermatology AI system recognises a previously-unseen condition.