Docs are ROCs: A simple fix for a methodologically indefensible practice in medical AI studies.
Widely used methods to compare doctors to #AI models systematically underestimate doctors, making the AI look better than it is! We propose a solution.
The most common method to estimate average human performance in #medical AI is to average sensitivity and specificity as if they are independent. They aren't though - they are inversely correlated on a curve.
The average points will *always* be inside the curve.
2/7
The only solution currently is to force doctors to rate images using confidence scores. While this works well in the few tasks where these scales are used in clinical practice, what does it mean to say you are 6/10 confident that there is a lung nodule?
3/7
Most clinical tasks have 2 (or 3) decision options.
Treat or don't. Biopsy or not.
Forcing doctors to do things that aren't part of their clinical practice is a terrible way to test their performance. We think if a task is binary, test the doctors that way.
4/7
So we suggest a simple off-the-shelf method: SROC analysis. Widely used in the meta-analysis of diagnostic accuracy, SROC is a well understood and validated way to summarise performance across diagnostic experiments.
For AI-human comparisons, each reader is an experiment.
5/7
We show how it works be re-evaluating several famous medical AI papers, for example Esteva et al on melanoma (below).
We think this is something everyone can do, and will improve the quality of reporting for AI vs human medical studies.
Check out the blog for more details.
6/7
As a quick final note: this doesn't only apply to medical AI studies. We often use similar methods in the radiology literature when we try to determine the accuracy of a test. The SROC approach applies equally well in normal diagnostic research.
7/7
PS better mention @PalmerLyle who coauthored the paper with me, had the original idea, and inspired my favourite self made gif ever.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Alright, let's do this once last time. Predictions vs probabilities. What should we give doctors when we use #AI / #ML models for decision making or decision support?
This discussion was getting long, so I thought I'd lay out my thoughts on a common argument: should models produce probabilities or decisions? Ie 32% chance of cancer vs "do a biopsy".
I favour the latter, because IMO it is both more useful and... more honest. IMO:
I personally suspect the biggest problem is automation bias, which is where the human over-relies on the model output.
Similar to self driving cars where jumping to complete automation appears to be safer than partial automation.
But interestingly (and perhaps counter-intuitively) this could also mean that "blind" ensembling (where the human gets no AI input, and the human and AI opinions are combined algorithmically) might be better than showing the doctor what the AI thinks.
@weina_jin The weird thing about CV in AI is that you don't actually end up with a single model. You end up with k different models and sets of hyperparameters.
It allows an estimate of generalisation for a *group* of models, but that is still a step removed from a deployable system.
1/ While this will play well (and get cited a lot) among the anti-#deeplearning holdouts, I was left a bit underwhelmed. I wanted to find some interesting edge cases where DL is not working (so we can work out solutions), but instead got a set of pretty unreasonable comparisons
2/ The deep learning models are tiny (4 conv layers) with justification that it works for MNIST. Everything works for MNIST! Linear regression works for MNIST!
We know in complex images deeper and more complex is vastly better, and does less overfitting!
3/ The linear and non-deep models are not "apples to apples" either though. This isn't deep learning vs simple models, it is deep learning vs incredibly complex feature engineering built up over decades of research.
Well, here is the 6 months later follow up on @Annals_Oncology paper by Haenssle et al, "Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists."
The paper claims "Most dermatologists were outperformed by the CNN", a bold statement. The relevant part of the paper is pictured.
I raised several concerns in those tweets:
1) they compared two different metrics (ROC-AUC vs ROC area) as if they were the same 2) they used average human performance 3) they seemed to cheat when picking an operating point for the model