For a certain type of cancer there are two treatments: A and B. From randomized trials it is known that A leads to better overall survival. There is no known variation in treatment effect among patient subgroups.
(2/)
Treatment A takes longer than B and has more side effects, so it’s only recommended for patients with a >10% chance of surviving 1 year. In current practice, the 1-year survival is estimated using covariates X.
(3/)
Some researchers think that biomarker Z is also predicts overall survival and build an awesome prediction model for survival using X and Z. Big success! On several big external datasets, survival is better predicted using X and Z than only X. Time to take this to the clinic!
(4/)
However, having conducted a prediction study and not a conditional average treatment effect study, it is not noted that Z leads to worse survival on average, but at the same time is associated with a higher relative efficacy of treatment A.
(5/)
Because Z is a new biomarker, treatments were assigned independently of Z and for the same X, patients with higher Z have worse survival. For some patients with high Z, the old model would predict >10% survival probability, while the new model predicts <10%.
(6/)
Based on the new model these ‘cross-over’ patients would not be treated with A, even though for them, A is even more advantageous than for other patients. Ultimately using the new prediction rule leads to WORSE clinical outcomes, despite ACCURATE predictions.
(7/)
So what's wrong here? The heart of the problem is that the prediction model is the right answer to the wrong question.
(8/)
The prediction model answers: what is the probability of survival, given X and Z? The question driving treatment decisions is: “would treatment A lead to >10% chance of 1-year survival, given that we know X and Z about this patient”
(9/)
Without appropriate appreciation of the causal nature of the decision that the prediction model is trying to inform, things can go terribly wrong! So what should be done?
1. The causal assumptions underlying the implications of prediction models should be made explicit.
(10/)
2. If possible, the actual question (“what’s the probability of outcome Y if we give treatment A or B, given that we know X and Z”) should be targeted, although this admittedly is not an easy question to answer as it requires causal assumptions (and/) or RCT data.
(11/)
Note: this is not an unreasonable scenario for cancer as aggressive / fast-growing (=Z) tumors frequently respond better to treatments like chemo or radiotherapy. E.g. non-seminoma vs seminoma testicular cancer
Post-script: I thought this example was interesting enough to share. It’s not meant to attack prediction research though I think the causal dimension of prediction is often under-appreciated. Curious to hear your thoughts about this! @MaartenvSmeden@f2harrell@eliasbareinboim