Artificial neural networks can make impressively accurate predictions, but we must look at these models skeptically especially in medicine
This #PrePrint finds many #AI systems designed to recognize #COVID19 on CXRs are finding shortcuts not signal.bit.ly/3c72lUo
A🧵
1/
The authors found that these apparently impressive ANNs were poorly generalizable (i.e., the performance was much worse on a new validation set compared to the training set).
Compare the red vs. green ROC curves. The performance drops from an AUC of 0.99 to 0.7! Yikes!
2/
There’s a reason for this: They used one dataset for all their positive images and a separate dataset for all their negative images.
This is risky for confounding because the model could pick up on any number of differences in CXRs that aren’t clinically meaningful.
3/
Turns out the neural networks weren't actually looking at the chest in the CXR; they were ‘cheating’ by looking at laterality markers and other parts of the images that were different between the two datasets.
Look at the salience maps (pixels the model deems most important):
4/
The AI ‘learned’ that specific laterality markers used in one hospital go with the COVID19+ cases & a different marker used in a different hospital go with the COVID19- cases
This marker cheating phenomenon was also described in PNA diagnosis in 2018
arxiv.org/abs/1807.00431
5/
3 things we do:
🤔be skeptical: ROC of 0.99 for detecting COVID on CXR is simply too good to be true
💾use better data: use a training set that has both + & - examples, validate with another dataset
👩⚕️medical experts should be involved in developing AI models to spot these biases
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
