Follow @DrLukeOR

12,399 views

Luke Oakden-Rayner

Follow @DrLukeOR

, 23 tweets, 8 min read

My Authors

https://twitter.com/VickersBiostats/status/1295489610139738112

https://twitter.com/VickersBiostats/status/1295489610139738112

Alright, let's do this once last time. Predictions vs probabilities. What should we give doctors when we use #AI / #ML models for decision making or decision support?

#epitwitter

1/21

https://twitter.com/VickersBiostats/status/1295489610139738112

First, we need to ask: is there a difference?

This is a weird question, right? Of course there is! One is a categorical class prediction, the other is a continuous variable. Stats 101, amirite?

Well, no.

2/21

Let's set out the two ways that probabilities are supposed to be different than class predictions.

1) they are continuous, not categorical
2) they are probabilities, meaning the numbers reflects some truth about a patient group and are not arbitrary

Weeeeell...

3/21

If you give a doctor a number between 1 and 100, is there a difference between a 100 class categorical and a probability score? Prima facie to the user they are the same thing.

That is a bit cheeky though, the usual case: prob vs a low-number-of-classes classifier.

4/21

But if you humour me and accept the idea that 100 or 1000 class categoricals are how probabilities are presented to docs, it highlights an important problem: how many categories is reasonable? What does the actual evidence justify?

5/21

If you want 100 categories, you are suggesting that 1% changes are relevant and reliable.

They are not. For any model. Ever.

Take QRISK for example. Trained on 1.2 million pts, validated on 600k. The model estimation of probability is consistently off by 5 to 10 percent.

6/21

This is called the calibration of the model. It is *always* a bit off. The type of off it is changes with new populations.

If you are making decisions based on small changes in predicted probability, you are making decisions based on statistical noise.

7/21

(This isn't even confidence intervals. Those are another source of variation.)

7b/21

Don't just take my word for it. There is lots of literature re the unreliability of granular interpretations of risk scores, and the much higher variability than expected, especially for individuals rather than populations.

bmcmedicine.biomedcentral.com/articles/10.11…

journals.plos.org/plosone/articl…

8/21

So 100 categories is probably unreasonable. Maybe 20 categories is a reasonable balance of information vs noise?

Well, what about the range of the probabilities? Risk scores are all focused at the low probabilities. "You have 100% chance of X" said no risk score ever.

9/21

QRISK, shown above, maxes out at around 20% probability of a "cardiovascular event" within ten yrs.

So in practice, you only have 20 noisy levels to play with in the first place!

So given the statistical noise, maybe we should cut that down further? 10 levels? 5 levels?

10/21

PS for those invoking "shared decision making" as if percentages are more intuitive for patients, the following conversations are super fun:

Doc: you are high risk for heart attacks, you need to start statins.
Them: oh no! What is my risk?
Doc: 8%!
Them:

11/21

(that isn't an attack on patients btw. Doctors are notoriously terrible at understanding pre test and post test probabilities.

bbc.com/news/magazine-…)

11b/21

Maybe at this point it is worth considering how a risk score gets produced and is used?

You don't just make a risk score and say "here you are doctors, use this as you will".

12/21

Instead a risk score with validated calibration and discrimination is only the starting point. Now you have to ask "well, what should we use it for?"

**This involves more research**

Usually an expert working group will decide on a use case, and clinicians start using it.

13/21

Then after a while you check to see if it helped. An example would be with QRISK, offering statins to everyone with greater than 10% probability of event within 10 yrs.

Guess what? That got updated to everyone above 7.5% after some years. Because more evidence came in.

14/21

Turns out the cut offs and categories we use in practice are either informed by real world evidence or consensus expert opinion. Unlike weighing raw probabilities like a math whiz, which is idiosyncratic choices by individuals.

Very idiosyncratic.

academic.oup.com/qjmed/article/…

15/21

Now, don't get me wrong. Doctors have a huge degree of autonomy for a reason. They are allowed to be idiosyncratic. They can misuse probabilities and risk scores. They can operate with a rusty spoon if they want.

They just have to bear the responsibility.

16/21

But, they have to be honest with themselves and their patients. That isn't evidence based medicine. Which, again, is fine. Most of what we do isn't evidence based.

But probabilities, particularly when presented to patients, seem *so damn science* that they are convincing.

17/21

I've spoken about automation bias and cognitive load and similar concepts before, but apparently this isn't convincing to many people, who think we are rational actors who can accurately weight information (like... what? People think this?)

I won't cover this again.

18/21

Some links though (as a warning, the literature is complex, contradictory, and reliant of very simple clinical decision systems since more complex ML/AI hasn't been widely used and tested in this way.

journals.sagepub.com/doi/10.1177/00…
qualitysafety.bmj.com/content/24/7/4… ncbi.nlm.nih.gov/pmc/articles/P…

19/21

But the overall point is probabilities tell us that we know (at a fine grained level) things we don't have evidence for.

Categorisation instead allows us to validate specific decision thresholds in real world practice, and broad classes are less affected by variability.

20/21

I don't know. I'm pretty over this discussion 😓

Please don't start talking about boating and weather apps again. I seriously doubt most sailors are making fine grained choices based on anything more than a few broad categorical predictions.

21/21

Try unrolling a thread yourself!

Related hashtags

More from @DrLukeOR see all

Embed code for your website

Did Thread Reader help you today?