Read on Twitter

Luke Oakden-Rayner @DrLukeOR

, 18 tweets, 5 min read Read on Twitter

@Annals_Oncology

@Annals_Oncology

Well, here is the 6 months later follow up on @Annals_Oncology paper by Haenssle et al, "Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists."

https://twitter.com/DrLukeOR/status/1006804796785950720

The paper claims "Most dermatologists were outperformed by the CNN", a bold statement. The relevant part of the paper is pictured.

I raised several concerns in those tweets:

1) they compared two different metrics (ROC-AUC vs ROC area) as if they were the same
2) they used average human performance
3) they seemed to cheat when picking an operating point for the model

Each biases in favour of the model.

Rather than jumping straight to Twitter in May, I spent time contacting the team who wrote the paper, but was unable to resolve the issues. So I contacted the journal, who asked I "write a letter to the editor", which they would peer review.

Now, this struck me as a little weird, since I was raising very clear methodological flaws that should invalidate the actual headline of the paper. The conclusions were simply wrong.

But I also understand that journals have processes to follow, so I did what they asked.

Now we are ... 6 months later. My letter is finally published, as is a reply.

In the meantime, the paper has been cited 35 times, has an altmetric of over 2300 and is in fact one of the top-100 altmetric papers of 2018.

As the journal proudly states on their page.

Unsurprisingly, while the original paper is open access, my letter to the editor and their reply is paywalled. Was I supposed to shell out a few grand to make this discussion public?

Here is my letter (note that I didn't include the concern about using averages for brevity).

Here is the reply.

So, they completely ignored point 1 - that they compared two different metrics, the equivalent of comparing an F1 score to an odds ratio.

This boggles my mind.

Is there any world where this does not need to be fixed within a week of it being pointed out? No letters needed.

I didn't include point two, but guess what? They gave out some new details anyway, in response to another letter.

I argued the "average" point was pretty dubious, because sens/spec are a curved distribution. Turns out there is also a subset issue - they included trainees.

Of the 13 dermatologists who "outperformed" the model, 11 were "experts". 5 or more years experience (not sure how 5 years is really the boundary for expertise, but ok).

None of the beginners/trainees "outperformed" the model.

Of the 13 dermatologists who "outperformed" the model, 12 were self-described skin cancer screening specialists. Almost half of this specialised group outperformed the model!

Of the non-specialised group, 1 out of 30 outperformed.

So do we think that this average score for "dermatologists" is a decent reflection of clinical practice? Half were trainees, and while we don't know exactly where they were on the ROC curve, we sure know they massively underperformed compared to certified and/or specialist derms.

So, my third point.

I was worried, but wasn't certain, that they cheated to find the model specificity. By cheating, I mean they used test results to select the operating point. This is a big no-no, because in clinical practice you don't have these results.

Well, their entire reply essentially says "yep, that is exactly what we did."

They even draw a diagram showing how they take the test results, draw a line on their ROC curve, and select the operating point.

If they did it right, by selecting the operating point that matches human sensitivity *on the training data*, their results would likely tank. A tiny change in the operating point would see sens or spec fall dramatically - the local ROC region is flat in both directions!

So, how did I like "post-publication peer review"?

IMO, the 6 month long process has done nothing useful, except provide a paywalled citation to the original paper. A paper that should have been fixed in peer review remains unchanged, and is lauded as a top paper of the year.

The worst thing is, it is a good paper! These problems could have been fixed in a week, the conclusions could have been toned down, and *I* would include the paper in my favourites of the year.

Instead, I just feel like the time I spent was wasted.

Like this thread? Get email updates or save it to PDF!

Subscribe to Luke Oakden-Rayner

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Luke Oakden-Rayner

This content may be removed anytime!

Try unrolling a thread yourself!

Related threads

Trending hashtags

Did Thread Reader help you today?