Microsoft claims their new AI framework diagnoses 4x better than doctors.
I'm a medical doctor and I actually read the paper. Here's my perspective on why this is both impressive AND misleading ... π§΅
What did they create? Two key innovations: 1. SDBench: A testing environment using 304 real medical mysteries from NEJM where AI starts with just "29yo woman with sore throat" and must decide what to ask/test next
2. MAI-DxO: An AI system that simulates 5 doctors working together as a team
How did they test the AI/Doctors?
They took 304 real cases from NEJM and turned them into an interactive game.
The setup:
Step 1: You (human doctor or AI) get a tiny intro like: "52-year-old man with fever and breathing problems." That's it. No test results, no detailed history - just like a patient walking into the ER.
Step 2: There's a "Gatekeeper" (another AI) that has the full case file but won't tell you anything unless you specifically ask.
Step 3: You can do three things: 1. Ask questions ("Any recent travel?" "Is there chest pain?") 2. Order tests ("CBC" "Chest X-ray" "CT scan") 3. Make your final diagnosis ("This is pneumonia")
Step 4: The Gatekeeper then answers the question. BUT it only reveals what you ask for. If you don't think to ask about travel history, you won't find out the patient just returned from a cave expedition (real case - histoplasmosis).
Step 5: Every test costs money (real US hospital prices). Every round of questions = $300 office visit.
MAI-DxO isn't a new model but instead a framework built on top of existing LLM's (ChatGPT, Claude, Gemini).
How does this framework work?
It asks the LLM to simulate a virtual panel of 5 specialised AI doctors:
Dr. Hypothesis (tracks diagnoses)
Dr. Test-Chooser (selects optimal tests)
Dr. Challenger (plays devil's advocate)
Dr. Stewardship (manages costs)
Dr. Checklist (quality control)
Then argue it out between themselves as to the best path forward.
The results?
π Accuracy:
Doctors: 20% (ouch)
Standard AI: 30-79%
MAI-DxO: 80-85.5%
π° Cost per case:
Doctors: $2,963
Standard AI (o3): $7,850
MAI-DxO: $2,397
On paper the AI was 4x more accurate AND cheaper.....
But there's five issues I see: 1. They used ZERO healthy patients
95% of sore throats are viral and this AI was only tested on incredibly rare diagnostic cases.
We don't know if it will order biopsies on every patient with a sore throat "just to rule out rhabdomyosarcoma."
2. "Cost-effective" ignores the human toll
Their costs only count lab fees, not:
- 2 weeks of anxiety waiting for biopsy results
- Radiation from "precautionary" CT scans (cancer risk!)
- Complications from unnecessary procedures
- Time off work
- Psychological trauma of false cancer scares
3. The physician comparison was rigged
Docs were banned from:
β Googling symptoms
β Consulting colleagues
β Using UpToDate/medical databases
β Calling specialists
That's not how we practice!!
It's like testing a chef who can't use recipes or taste their food.
4. The "Retrospective Oracle" Problem
These cases were already SOLVED and published.
Real medicine involves genuine uncertainty - sometimes the diagnosis is never found. Does the AI know when to stop investigating?
5. No "When to Stop" Testing
Great doctors know when NOT to test. This AI was never evaluated on:
"This headache is just stress"
"Let's wait and see"
"More tests will cause more harm than good"
The benchmark rewards finding zebras, not recognising horses.
Don't get me wrong - this tech is amazing and I have no doubt I might be getting replaced in the not so near future.
But we need:
β Testing on actual patient populations (mostly healthy!)
β Measuring overdiagnosis harm
β Real-world physician comparisons
Final thought: We don't need AI that can diagnose every rare disease. We need AI that knows when to diagnose and when to reassure. That's the real art of medicine.
But what do you think?
If you liked this post please follow me @DrDominicNg and retweet.
It takes me some time to read and write these posts so I'd love to get more people's thoughts on it!
CRISPR just scored its biggest win yet against Huntington's.
The secret? A delivery system called RIDE that sneaks into neurons, makes its edit, then vanishes in 72 hours.
Here's what happened π§΅π
First - what is Huntington's?
Picture DNA as a sentence. In Huntington's, one word gets repeated too many times: CAG-CAG-CAG-CAG... This repetition builds toxic proteins that kill brain cells.
That leads to uncontrolled movements, emotional instability, progressive dementia.
Current medicine can only dull the symptoms - nothing stops the underlying problem.
DeepMind just dropped a 106-page paper unveiling AlphaGenome.
This single model could completely redefine how we discover disease-causing mutations and drug targets.
This is massive. π§΅
The challenge?
>98 % of human variants lie in non-coding DNA which they exert INDIRECT regulatory effects on the proteins your body makes
The problem with current tools is that: 1. Models either analyse long DNA sequences to see distant interactions but with a blurry, low-resolution output, or they could provide sharp, single-base detail on a tiny snippet, missing the broader context.
While everyone was obsessing over CRISPR, a small team just quietly published a paper in Science solving genetic medicine's biggest problem.
They created a system that can fix thousands of different mutations at once. Here's how they did it π§΅
Current gene editing 101: You inherit a disease-causing mutation β CRISPR-Cas9 targets that exact DNA sequence β cleaves both strands β cell repairs it with correct template. Already curing sickle cell. Already reversing genetic blindness. Already changing medicine.
But CRISPR has a massive blind spot: it needs to know EXACTLY which mutation to fix. Problem is, most genetic diseases aren't that simple. Cystic fibrosis? Over 2,000 different mutations. Same disease, different typos in every patient. Now what?