Proprietary prediction models are widely implemented in health care. Let’s talk about why they exist and if we can (or should) move away from them.
Let's start with a poll. I'll return to this soon with a story about the slow death of an open model.
Why are they used at all?
All are partially true (will explain) but C is right. Proprietary models are used bc EHRs suffer from a last-mile problem. While scientists debate whether models should even be made available (nature.com/articles/s4158…) the truth is that we don’t have many ways to implement models.
(Check out the @nature letter to @Google informing them about Colab)
So, the reason that hundreds of US hospitals use proprietary prediction models today has to do with the fact that these are EHR vendor-developed models and are thus easiest to implement in the EHR itself.
But I said all of the answers were partially right. How is that possible?
It's worth considering what a proprietary model actually is.
Which definition best captures the characteristics of a proprietary model?
Is it a model whose...
...variables are not known?
...form/coefficients are not known?
...performance is not known
...can't be used w/o $?
D is most correct but others can be true. EHR vendors usually provide information on which variables (+/- actual coefficients) and info on model validation performed at other institutions. If implemented, vendors will even calculate local performance.
But can you trust it?
Having read a dozen+ proprietary model briefs, the quality of validation (and assumptions) are highly variable. Also, some vendors are more aggressive than others re subjecting models to peer review. But vendors do privately share validation info with hospitals.
Now for a story.
Let's talk about APACHE, a series of models that help ICUs assess whether their ICU mortality is better or worse than expected based on patient severity. The story comes straight from its developer (Dr. William Knaus).
APACHE was invented in 1978 in response to the unexpected death of a young patient. APACHE I was developed in 582 pts and published in 1981. It was tested in France, Spain, and Finland. It was somewhat complex (requiring 33 physiologic measurements), which limited adoption.
APACHE II reduced the complexity of APACHE (used only 12 physiologic measurements) and adoption was rapid.
The 1st problem was that carrying out this international effort to standardize quality measurement was expensive. A company was formed and $ was raised from venture capital.
The 2nd problem was that poor performers doubted the accuracy of APACHE II.
Solution? APACHE III.
APACHE III improved the AUC from 0.86 (APACHE II) to 0.90. It also addressed issues specific to surgery, trauma, comatose status, etc.
But unlike APACHE II, APACHE III was proprietary.
And it cost money, which led to an investigation re: misuse of funds.
Many ICU physicians were also not pleased with the prospect of paying for the score.
When told about the cost required to run the company and calculate the scores, Dr. Knauss was told to go "get more grants," which wasn't really an option.
...then APACHE got bought by Cerner.
Cerner is one of the two largest EHR vendors in ths US (alongside Epic). Since ICUs generally found APACHE III useful but didn't want to pay for it, it seems ideal that they got bailed out by Cerner, right?
Kind of like how Microsoft bailed out GitHub?
...so what did Cerner do?
Cerner unveiled.... *drumroll*
APACHE IV!
Features:
- better calibrated than APACHE III
- more complex than APACHE II/III
"Also we recommend APACHE II no longer be used..."
So how complex was it?
APACHE IV is so complex that centers often perform manual chart validation to confirm that the elements going into the model are accurate.
Also, conveniently, APACHE IV isn't integrated with the Epic EHR (hmm wonder why?)
Meanwhile, in non-proprietary land, the SAPS-3 model tried to resurrect the simplicity of APACHE II -> simple, but not as good (ncbi.nlm.nih.gov/pmc/articles/P…)
Also, Epic introduced a proprietary ICU mortality prediction model that appears to emulate APACHE IV and is easy to integrate.
So which would you use?
- a complex proprietary model owned by Cerner (APACHE IV)
- a simple prediction model (MPM-3) also owned by Cerner
- the proprietary Epic ICU mortality model
- the non-proprietary SAPS-3 (performed worse in an independent validation)
- outdated APACHE II
Dr. Knauss, inventor of APACHE, has this to say in a footnote on the MDCalc page for APACHE II (mdcalc.com/apache-ii-score)
"In retrospect, if we had known the future was going to be as limited in the development of health IT, I think we would've said, let's stay with APACHE II."
If you work in an ICU, I'd love to know:
What does your ICU actually use to measure how well it is doing in terms of expected vs. observed mortality?
So what's the moral of the story?
Proprietary models are here to stay (for now), but we need to urgently adopt mechanisms to disseminate and operationalize open-source models in the EHR. This is available in some EHRs but not all. And it's competely different for each EHR.
Closing thoughts (1/2): We can d/l our patient records today bc of the Blue Button and @myopennotes initiatives.
Closing thoughts (2/2): I'll go further and say that we need an OpenModel initiative that allows prediction models to interface in a constent manner with all EHRs. Not just PMML (model format) but communication stds.
Without it, the future consists mostly of proprietary models.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I had the opportunity to work with @AndrewLBeam and @bnallamo on an editorial sharing views on the reporting of ML models in this month’s @CircOutcomes. First, read the guidelines by Stevens et al. Our editorial addresses what they can and can’t fix.
First, everyone (including Stevens et al.) acknowledges that larger efforts to address this problem are underway, including the @TRIPODStatement’s TRIPOD-ML and CONSORT-AI for trials.
The elephant in the room is: What is clinical ML and what is not? (an age old debate)
What’s in a name clearly matters. Calling something “statistics” vs “ML” affects how the work is viewed (esp by clinical readers) and (unfairly) influences editorial decisions.
How can readers focus on the methods when they can’t assign a clear taxonomy to the methods?
Since my TL is filled with love letters to regression, let's talk about the beauty of random forests. Now maybe you don't like random forests are or don't use them or are afraid of using them due to copyright infringement.
Let's play a game of: It's Just a Random Forest (IJARF).
Decision trees: random forest or not?
Definitely a single-tree random forest with mtry set to the number of predictors and pruning enabled.
IJARF.
Boosted decision trees: random forest or not?
Well, if you weight the random forest trees differently, keep their depth shallow, maximize mtry, and grow them sequentially to minimize residual error, then a GBDT is just a type of random forest.
Why did we do this? What does it mean? Is the Epic deterioration index useful in COVID-19? (Thread)
Shortly after the first @umichmedicine COVID-19 patient was admitted in March, we saw rapid growth in the # of admitted patients with COVID-19. A COVID-19-specific unit was opened (uofmhealth.org/news/archive/2…) and projections of hospital capacity looked dire:
.@umichmedicine was considering opening a field hospital (michigandaily.com/section/news-b…) and a very real question arose of how we would make decisions about which COVID-19 patients would be appropriate for transfer to a field hospital. Ideally, such patients would not require ICU care.
I’ll be giving a talk on implementing predictive models at @HDAA_Official on Oct 23 in Ann Arbor. Here’s the Twitter version.
Model developers have been taught to carefully think thru development/validation/calibration. This talk is not about that. It’s about what comes after...
But before we move onto implementation, let’s think thru what model discrimination and calibration are:
- discrimination: how well can you distinguish higher from lower risk people?
- calibrations: how close are the predicted probabilities to reality?
... with that in mind ...
Which of the following statements is true?
A. It’s possible to have good discrimination but poor calibration.
B. It’s possible to have good calibration but poor discrimination.
The DeepMind team (now “Google Health”) developed a model to “continuously predict” AKI within a 48-hr window with an AUC of 92% in a VA population, published in @nature.
Did DeepMind do the impossible? What can we learn from this? A step-by-step guide.
The 2016 @CJASN paper used logistic regression and 2018 paper used GBMs.
The 2016 CJASN paper is particularly relevant because it was also modeled on a national VA population. Altho the two papers used different modeling approaches, one key similarity is in how the data are prepared: using a discrete time survival method.