Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Dr. Dominic Ng

@DrDominicNg

Jun 30 • 13 tweets • 4 min read • Read on X

Microsoft claims their new AI framework diagnoses 4x better than doctors.

I'm a medical doctor and I actually read the paper. Here's my perspective on why this is both impressive AND misleading ... 🧵

What did they create? Two key innovations:
1. SDBench: A testing environment using 304 real medical mysteries from NEJM where AI starts with just "29yo woman with sore throat" and must decide what to ask/test next

2. MAI-DxO: An AI system that simulates 5 doctors working together as a team

How did they test the AI/Doctors?
They took 304 real cases from NEJM and turned them into an interactive game.

The setup:
Step 1: You (human doctor or AI) get a tiny intro like: "52-year-old man with fever and breathing problems." That's it. No test results, no detailed history - just like a patient walking into the ER.

Step 2: There's a "Gatekeeper" (another AI) that has the full case file but won't tell you anything unless you specifically ask.

Step 3: You can do three things:
1. Ask questions ("Any recent travel?" "Is there chest pain?")
2. Order tests ("CBC" "Chest X-ray" "CT scan")
3. Make your final diagnosis ("This is pneumonia")

Step 4: The Gatekeeper then answers the question. BUT it only reveals what you ask for. If you don't think to ask about travel history, you won't find out the patient just returned from a cave expedition (real case - histoplasmosis).

Step 5: Every test costs money (real US hospital prices). Every round of questions = $300 office visit.

MAI-DxO isn't a new model but instead a framework built on top of existing LLM's (ChatGPT, Claude, Gemini).

How does this framework work?
It asks the LLM to simulate a virtual panel of 5 specialised AI doctors:
Dr. Hypothesis (tracks diagnoses)
Dr. Test-Chooser (selects optimal tests)
Dr. Challenger (plays devil's advocate)
Dr. Stewardship (manages costs)
Dr. Checklist (quality control)

Then argue it out between themselves as to the best path forward.

The results?
📊 Accuracy:
Doctors: 20% (ouch)
Standard AI: 30-79%
MAI-DxO: 80-85.5%

💰 Cost per case:
Doctors: $2,963
Standard AI (o3): $7,850
MAI-DxO: $2,397

On paper the AI was 4x more accurate AND cheaper.....

But there's five issues I see:
1. They used ZERO healthy patients
95% of sore throats are viral and this AI was only tested on incredibly rare diagnostic cases.

We don't know if it will order biopsies on every patient with a sore throat "just to rule out rhabdomyosarcoma."

2. "Cost-effective" ignores the human toll
Their costs only count lab fees, not:
- 2 weeks of anxiety waiting for biopsy results
- Radiation from "precautionary" CT scans (cancer risk!)
- Complications from unnecessary procedures
- Time off work
- Psychological trauma of false cancer scares

3. The physician comparison was rigged
Docs were banned from:
❌ Googling symptoms
❌ Consulting colleagues
❌ Using UpToDate/medical databases
❌ Calling specialists

That's not how we practice!!
It's like testing a chef who can't use recipes or taste their food.

4. The "Retrospective Oracle" Problem
These cases were already SOLVED and published.

Real medicine involves genuine uncertainty - sometimes the diagnosis is never found. Does the AI know when to stop investigating?

5. No "When to Stop" Testing
Great doctors know when NOT to test. This AI was never evaluated on:

"This headache is just stress"
"Let's wait and see"
"More tests will cause more harm than good"

The benchmark rewards finding zebras, not recognising horses.

Don't get me wrong - this tech is amazing and I have no doubt I might be getting replaced in the not so near future.

But we need:
✓ Testing on actual patient populations (mostly healthy!)
✓ Measuring overdiagnosis harm
✓ Real-world physician comparisons

Final thought: We don't need AI that can diagnose every rare disease. We need AI that knows when to diagnose and when to reassure. That's the real art of medicine.

But what do you think?

If you liked this post please follow me @DrDominicNg and retweet.

It takes me some time to read and write these posts so I'd love to get more people's thoughts on it!

I've also just started a new newsletter on neuroscience:
brainhealthdecoded.substack.com/subscribe

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Dr. Dominic Ng

Try unrolling a thread yourself!

More from @DrDominicNg

Dr. Dominic Ng

Dr. Dominic Ng

Dr. Dominic Ng

Dr. Dominic Ng

Dr. Dominic Ng

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!