David Robinson Profile picture
Jul 21, 2021 11 tweets 8 min read Read on X
NEW BLOG POST: Machine learning in a hurry: what I've learned from the #SLICED ML competition

varianceexplained.org/r/sliced-ml/ #rstats Meme with text:  Planes labeled with "LINEAR MODELS WIT
You can vote for me to go to the playoffs of #SLICED here! forms.gle/1JNC31EAbxTAaf…

To summarize what I've learned so far from #SLICED 🧵
1. The tidymodels #rstats package is a really powerful "pit of success" for machine learning 🔥tidymodels.org

Perhaps the most important innovation is the combination of recipes (feature engineering) and models into an encapsulated workflow. (Continued)
We know we have to separate train/test data for model fitting. But feature engineering like

* normalization
* imputing missing values
* dimensionality reduction

*also* have to be trained! Applying this cleaning to the training+test data together can cause data leakage! 😱
The recipes package solves this

Instead of preprocessing the data yourself, you specify a recipe for cleaning, such that it can be prepared on any training set & applied to any test set

Makes cross validation easy!

(Source tidymodels.org/start/recipes/) #tidymodels #rstats
2. Gradient boosted trees are absolutely OP for Kaggle competitions on tabular data.

My background is statistics, so I really do like interpretable models where I know the generative process, understand the coefficients, and can calculate p-values

But xgboost PLAYS TO WIN
I've noticed xgboost suffers when the data has many low-value features (e.g. a sparse categorical variable), at least for modest values of learn rate / # of trees.

I think weak features are basically "diluting" the strong ones.

Anyone know of research on this problem?
3. You can often improve on xgboost by stacking it with another model!

Every model has some error. But if 2 models have *different* errors, then averaging them can sometimes make an even better model.

The stacks 📦 is the tidymodels way to do this: github.com/tidymodels/sta… (Source: https://github.com/tidymodels/stacks)  A diagram re
My screencast on #SLICED ep5, predicting Airbnb prices, is an example of model stacking. I stacked:

* a xgboost model on a few numeric features
* a LASSO model on thousands of text features

to make a model that easily beat either by itself #tidymodels
4. If I make it to the #SLICED playoffs, I need to learn how to meme.

@kierisi, @StatsInTheWild, and others have been crushing me in the chat voting portion of the competition, just because my screencasts didn't include "memes" or a "sense of humor" or "any personality at all"
I like to think it's never too late to learn.

So if you want to see me struggle to be relatable in the #SLICED playoffs, don't forget to vote! forms.gle/1JNC31EAbxTAaf…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with David Robinson

David Robinson Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @drob

Nov 5, 2020
Visual version of PA Dems' analysis

Most of the counties with many votes remaining (especially Philly, but even the next few largest) are ones we expect to lean heavily Biden Image
In the time it took me to parse data from a PDF (😭) Trump's lead has shrunk from 115K to 97K
#rstats code for this figure (also data if you'd like to check the PA Dems' work) gist.github.com/dgrtwo/5399fb0…
Read 4 tweets
Jul 22, 2020
Let's do this

ONE LIKE = ONE INSULT OF A DISTRIBUTION
NEGATIVE BINOMIAL: You are neither negative nor binomial. Get a better name
POISSON: You are the frictionless spherical cow of distributions
Read 56 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(