The presenter puts up a slide showing “random forest variable importance.” You know the one...
The sideways bar plot.
Says “only showing the top 20 variables here...” to highlight the hi-dimensional power of random forests.
The slide is awkwardly wide-screen. Everyone squints.
A clinician in the front row exclaims, “Wow, that makes so much sense!”
Silence.
Then someone asks, “What do the length of the bars mean?”
The presenter starts to answer when someone else butts in, “Does the fact that they are pointing the same direction mean anything?”
The audience stares at the presenter expectantly. There will be a deep explanation.
“The bar length is relative... the amplitude doesn’t have any interpretable meaning. But look at the top 3 variables. Ain’t that something?”
The clinicians exhale and whisper in reverence.
Ask yourself:
- What *do* the bars mean?
- Does ordering of the bars actually matter?
- Is “random forest variable importance” one thing, or many?
- Does the way you construct the trees matter?
Like/RT and I’ll return to this in a bit w/ a nice paper looking at all of these.
Before we answer the question of what random forest variable importance means, let’s take a look at a random forest variable importance plot created on simulated data. Which is the most important variable?
So which is it?
WRONG!!!
Now what if I told you that these were the variable importance scores assigned by a random forest where the 5 predictors were *simply RANDOM numbers*?
Great, now I have your attention.
Fine, you’re thinking to yourself: that was cheating. What about a situation where only a single variable is correlated with the outcome in simulated data and the other variables are just noise?
Can you find the 1 important variable?
Which one is the most important?
WRONG AGAIN. It’s X2. Again, this is simulated data to ensure it is X2.
But why?! Why has the random forest failed us? It can’t possibly get worse, can it?
So let’s talk about what went wrong. First, you need to know that there are multiple ways to calc random forest var importance.
A few common ones:
1.# splits/variable
2.Gini importance
3.Unscaled permutation imp
4.Scaled perm imp
Which is your go-to RF var imp?
Let’s start with a quick refresher. RF is a set of CART trees where each tree is made up of nodes containing variable splits. Each tree is built from a bootstrap of the data and a set of F variables are randomly selected at each node.
After evaluating a set of splits from these F variables, the “best” split is chosen. How? We’ll come back to that.
On a side note, you may be thinking to yourself “Why are you calling it F? Isn’t that mtry?”
In the same paper, Breiman refers to number of predictors as *m*.. When Breiman’s code was ported to the #rstats randomForest package, *F* was named mtry and it’s been mtry ever since.
Why mtry? Bc that’s the number of m’s to try. And in his paper, Breiman tried multiple m’s.
Now, assuming that these splits are good, we could just look at how often variables are used in the splitting process. More splits = more importance right?
Bwhahahahhah no.... This is the null example where all predictors are random numbers.
In summary:
Breiman random forest (with bootstrapping): FAIL
Breiman RF with subsampling: FAIL
Conditional inference forests with BS: FAIL
CIF with subsampling: oooh, *success*
So are you using conditional inference forests? Umm probably not.
Why not? Because their implementation in #rstats is SLOW. And remember, it’s not that CIF does way better than RF in prediction, simply in identifying variable importance better in this example. What about when X2 is the only important variable?
Again, conditional inference forests knock it out of the park when looking at # splits per variable, especially when bootstrapping is replaced by subsampling (or “subagging” as some authors like to say but we shall never repeat ever again).
Why does subsampling work better? Strobl et al explain this by referring to others’ work showing that subsampling generally works better under weaker assumptions. One other advantage of subsampling? SPEED 🏎 🔥
By how much? Subsampling is 25% faster in the Strobl paper for randomForest and nearly twice as fast for cforest. What are the defaults in #rstats?
Ok, let’s get back on track. What about Gini importance? We know that RFs use Gini “impurity” to choose splits. Why not use it to determine variable importance? Well, remember our opening examples that were all failures? Those were Gini var imp measures.
First of all, what is Gini impurity? It’s a *super easy* thing to calculate. It’s calculated as f*(1-f) summed over all classes where f is the probability of each class.
If your outcome is binary where 30% of ppl experience an event, then *without a model*, your Gini impurity is 0.3(1-0.3) + 0.7(1-0.3). Those numbers are the same so you could also just look at one and double it. It’s 0.42.
Another way to calculate this is 1-0.3^2 -0.7^2 but I prefer the first formula bc it more easily generalizes to multinomial situations.
Then, you find a perfect split that sends all cases to the left and all non-cases to the right. Then, Gini index of the left child = 1(1-1) + 0(1-0) and right child = 0(1-0) + 1(1-0). Summing the left and right children, your new Gini index is 0! Woohoo!
So is the Gini importance of that split = 0?
No. It’s equal to 0.42 (value before split) minus 0 (value after split), so impurity reduction (or Gini “gain”) is 0.42. You choose the splits with the highest reduction in Gini impurity.
For each variable, you repeat this process for each variable across all nodes where that variable was used for splitting *and* across all trees. Doing this, you arrive at some large number. That’s the huge importance number you commonly see in your output.
This is often rescaled (largest number to 1), which is often included in your output. And then this is often rescaled such that the quantities add up to 100% so that you can rattle off the “percent importance” of a variable.
So Gini “gain” is calculated just on way, right? Not quite. In this LECTURE (!) by Jerome Friedman, he shows how Breiman’s randomForest pkg uses equation 44, whereas other imps (eg @h2oai) use the squared version (45). projecteuclid.org/download/pdf_1…
Gini “gain” seems like a natural candidate for identifying importance variables since it’s what the tree “uses” when choosing vars in the first place! Alas, Gini scores can be gamed. High-cardinality categorical variables and continuous vars are able to attain higher Gini gain.
This is why the random number made its way to the top of our variable importance plot here. Because it was random, it essentially had a unique value for each patient.
Before we move onto permutation importance, let me ask you a computational question. Does your computer have enough computational power to model a single CART decision tree consisting of 5 variables and 400,000 rows?
I once worked with a student who did *not* have the computational power to do it. The rpart #rstats package crashed his laptop. So I gave him access to my VM with 48 GB RAM. Crashed again. What do you think happened?
Turns out he left the row identifier in as one of the columns. It was coded as a categorical variable. Because rpart exhaustively searches for splits, it had to consider 400,000 choose 2 (79999800000) splits when it got to the row identifier. That’s bad news.
If he had used the scikit-learn implementation in Python, he would’ve gotten an error because in the year 2020 sklearn RF *still* can’t handle categorical variables.
But one-hot coding that would’ve created *checks notes* 400,000 dummy variables so yeah the lesson is never include your row identifier variable as a predictor.
Lastly, what is permutation importance in an RF? Well, it’s slightly complicated. Remember how you built bootstrapped or subsampled trees? Well, the remainder of your data for each tree is the out of bag (oob) data.
To calculate perm imp, you first calculate the oob misclassification error up front. Then you permute variables one at a time. When a variable is important, permuting it wreaks havoc. Otherwise nothing much changes.
Subtracting mean permuted error from the original error for each variable in the oob data gives you the variable importance. This is called the unscaled permutation importance.
Since the permuted error may vary quite a bit across trees, Breiman argued that this value should be divided by the std error to “scale” it. In fact, this is the default setting in the randomForest #rstats pkg.
Both types of permutation importance seem to work well. Here’s an example of scaled perm imp when X2 is the only imp variable.
So, we should just go with permutation importance right? Well, remember how you chose to use RFs because they are great at handling multicollinearity? It’s about to bite you in the backside. Why?
Because when you permute a variable that is correlated with another variable, it *looks* as tho that var isn’t important bc the correlated var carries similar info. This is both a pro and a con but mostly a con when dealing with 100s of predictors.
So you’ve made it this far. Wow. Where to go next? Read about mean depth and maximal subtree var imp methods from the illustrious Hemant Ishawaran (author of the randomForestSRC package).
But most importantly, think a bit more critically the next time you see one of these plots. That “most important variable” might just be a random number. /Fin
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I had the opportunity to work with @AndrewLBeam and @bnallamo on an editorial sharing views on the reporting of ML models in this month’s @CircOutcomes. First, read the guidelines by Stevens et al. Our editorial addresses what they can and can’t fix.
First, everyone (including Stevens et al.) acknowledges that larger efforts to address this problem are underway, including the @TRIPODStatement’s TRIPOD-ML and CONSORT-AI for trials.
The elephant in the room is: What is clinical ML and what is not? (an age old debate)
What’s in a name clearly matters. Calling something “statistics” vs “ML” affects how the work is viewed (esp by clinical readers) and (unfairly) influences editorial decisions.
How can readers focus on the methods when they can’t assign a clear taxonomy to the methods?
Since my TL is filled with love letters to regression, let's talk about the beauty of random forests. Now maybe you don't like random forests are or don't use them or are afraid of using them due to copyright infringement.
Let's play a game of: It's Just a Random Forest (IJARF).
Decision trees: random forest or not?
Definitely a single-tree random forest with mtry set to the number of predictors and pruning enabled.
IJARF.
Boosted decision trees: random forest or not?
Well, if you weight the random forest trees differently, keep their depth shallow, maximize mtry, and grow them sequentially to minimize residual error, then a GBDT is just a type of random forest.
Why did we do this? What does it mean? Is the Epic deterioration index useful in COVID-19? (Thread)
Shortly after the first @umichmedicine COVID-19 patient was admitted in March, we saw rapid growth in the # of admitted patients with COVID-19. A COVID-19-specific unit was opened (uofmhealth.org/news/archive/2…) and projections of hospital capacity looked dire:
.@umichmedicine was considering opening a field hospital (michigandaily.com/section/news-b…) and a very real question arose of how we would make decisions about which COVID-19 patients would be appropriate for transfer to a field hospital. Ideally, such patients would not require ICU care.
I’ll be giving a talk on implementing predictive models at @HDAA_Official on Oct 23 in Ann Arbor. Here’s the Twitter version.
Model developers have been taught to carefully think thru development/validation/calibration. This talk is not about that. It’s about what comes after...
But before we move onto implementation, let’s think thru what model discrimination and calibration are:
- discrimination: how well can you distinguish higher from lower risk people?
- calibrations: how close are the predicted probabilities to reality?
... with that in mind ...
Which of the following statements is true?
A. It’s possible to have good discrimination but poor calibration.
B. It’s possible to have good calibration but poor discrimination.
The DeepMind team (now “Google Health”) developed a model to “continuously predict” AKI within a 48-hr window with an AUC of 92% in a VA population, published in @nature.
Did DeepMind do the impossible? What can we learn from this? A step-by-step guide.
The 2016 @CJASN paper used logistic regression and 2018 paper used GBMs.
The 2016 CJASN paper is particularly relevant because it was also modeled on a national VA population. Altho the two papers used different modeling approaches, one key similarity is in how the data are prepared: using a discrete time survival method.
I recently had a manuscript accepted that was outlined, written, and revised entirely in Google Docs. Am writing this thread to address Qs I had during the process in case they are of use to others.
The elephant in the room: how to handle references?
This one is easy: use the PaperPile add-on for Google Docs (not the Chrome extension). It’s free and you can handle citation styles for pretty much every journal.
1. Search paper 2. Click “Cite” 3. Click “Update bib” to format
Do you have a collaborator who likes to make edits in Word?
1. Save the GDoc as a .docx file and share 2. Reimport revised Word doc with tracked changes into GDocs (as new file) 3. Accept/reject suggestions in GDocs
Citations still work! (bc PaperPile refs are roundtrip-safe)