Bojan Tunguz Profile picture
Feb 17 27 tweets 6 min read
A few weeks ago I came across a tweet by a prominent ML/AI developer and researchers that promoted a new post about the use of transformers based neural networks for tabular data classification.

keras.io/examples/struc…

I took a look. Here is what I found. A 🧵 👇 1/27
The post was on Keras’ official site, and it seemed like a good opportunity to learn how to build transfomers with Keras, somethig that I’ve been meaning to do for a while. However, one part of the post and the tweet bothered me. 2/27
However, one part of the post and the tweet bothered me. It claimed that the model mateched “the performance of tree-based ensemble models.” As those who know me well know, I am pretty bullish on the tree-based ensemble models, 3/27
and have not been convinced at all that the improvements in the NN based models for tabular data over the past few years have really cought up. So I was skeptical to say the least, but did not have the time to delve into this deeper. 4/27
I finally had more time to take a look last week, and it turns out that my skepticism were justified. And then some. 5/27
Let’s take a look at what the problem is and what dataset we are using. The dataset is the United States Census Income Dataset. The total dataset contains about 48,000 rows, with 5 numerical features and 9 categorical features. 6/27
The dataset is split into the training set of about 32,000 rows, and the test set with 16,000 rows. Here is what the head of the training data looks like: 7/27
We are asked to predict the income bracket. The two income brackets are those making less or equal to $50K a year, and those who make more than that. This is a classic binary classification problem. 8/27
The first thing we should do is convert the income bracket column to a numerical 1–0 column. And then a smart thing to do is to take a look at the target distribution. 9/27
It turns out that our target is imbalanced. Not terribly imbalanced compared to most classification problems I’ve come across, but imbalanced nonetheless. This is not a major problem, but it should guide us in the choice of metric and our modeling approach. 10/27
This is when my alarm bells start going off. In the post the author chose to work with accuracy as. In this particular case just having an all-zeros solution would give us 76.4% on the test dataset, which is not far below the top transformer-based model with 84.5% accuracy. 11/27
My First Model — Logistic Regression with Only Numerical Features + gender

This is a very simple model with no preprocessing and using just the original features and sklearn LogisticRegression: 12/27
So yeah, we get 82.2% accuracy, which is on par with the pretty involved MLP that the post used! 13/27
I started one-hot encoding categorical features. With just an addition of two more — race and relationship — I was able to match a sophisticated Transformer model with only a simple Logistic Regression! 14/27
After adding most one-hot-encoded categorical features, I was able to get 85.3% accuracy, with just a single Logistic Regression. That is almost a whole percentage point higher than the Transformer in the example! 15/27
After that, I decided to reach for the "big guns." I used just a simple HistGradientBoostingClassifier from skelarn and label encoding. Got 87.3 % accuracy! 16/27
Finally — XGBoost. In this example XGboost did not do much better than a simple HGBC, but nonetheless managed to get a small bump to 87.4%: 17/27
One More Sanity Check — sklearn MLP. It turns out that even the humble skelarn MLP is up to the task, although it does not beat the logistic regression by much: 18/27
There are a few lessons here.

1. My intention with this post and this work in neither to trash Keras, Transformers, or the author of the post. They are all great and important parts of the DS/ML/AI community. 19/27
2. Trust but verify. The whole DS/ML/AI community is wonderful in that most developers and researchers make their code and datasets publicly available. This radical transparency is an amazing asset, and it is such a dramatic departure from the traditional research. 20/27
3. Make sure you do some EDA. This can help you understand your data, and guide you in how to formulate your problem. 21/27
4. Build a simple linear baseline. That model will go a long way in telling you how much info is there in the raw data, and how much extra info will your subsequent more sophisticated nonlinear models be able to uncover. 22/27
5. When it comes to tabular data, Gradient Boosted Trees rule. They are simple to use, straightforward to train, and have a great amount of predictive power. Using more sophisticated modeling can help, but there should really be an overwhelmingly strong case in their favor. 23/27
6. There have been many claim of the use of advanced NNs for tabular data problems that perform as well or allegedly outperform GBTs. However, it is very hard to asses the overall veracity of those claims. 24/27
When Kaggle used to be primarily about tabular data problems, those claims could have been subjected to swift and brutal test in the competative Kaggle competitions. These days we don’t have such a robust verification mechanism. 25/27
Thank you for making it this far. This thread is also a post on Medium:

medium.com/@tunguz/about-… 26/27
And all of my code can be found in the following GitHub repo:

github.com/tunguz/UCI_Cen…

27/27

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Bojan Tunguz

Bojan Tunguz Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tunguz

Jan 20
The current issue of @Nature has three articles that show how to make those error-correcting mechanisms achieve over 99% accuracy, which would make silicon-based qubits a viable option for the large-scale quantum computational devices.

#computing #quantumcomputing 1/
Fast universal quantum gate above the fault-tolerance threshold in silicon:
nature.com/articles/s4158…

Precision tomography of a three-qubit donor quantum processor in silicon:
nature.com/articles/s4158…

2/
Quantum logic with spin qubits crossing the surface code threshold:
nature.com/articles/s4158… 3/
Read 8 tweets
Dec 14, 2021
I posted this back in January:

I've worked for 4 different tech companies in various Data Science roles. For my day job I have never ever had to deal with text, audio, video, or image data. 1/4
Based on the informal conversations I've had with other data scientists, this seems to be the case for the vast majority of them. 2/4
Almost a year later this remains largely true: for the *core job* related DS/ML work, I have still not used any of the aforementioned data. However, for work-related/affiliated *research* I have worked with lots of text data. 3/4
Read 4 tweets
Oct 16, 2021
1/ After a year of work, our paper on mRNA Degradation is finally out!

paper: arxiv.org/abs/2110.07531
code: github.com/eternagame/Kag…
2/ A year ago I was approached with a unique and exciting opportunity: I was asked to help out with setting a Kaggle Open Vaccine competition, where the goal would be to come up with a Machine Learning model for the stability of RNA molecules.
3/ This is of a pressing importance for the development of the mRNA vaccines. The task seemed a bit daunting, since I have had no prior experience with RNA or Biophysics, but wanted to help out any way I could.
Read 8 tweets
Dec 18, 2020
One of the unfortunate consequences of Kaggle's inability to host tabular data competitions any more will be that the fine art of feature engineering will slowly fade away. Feature engineering is rarely, if ever, covered in ML courses and textbooks. 1/
There is very little formal research on it, especially on how to come up with domain-specific nontrivial features. These features are often far more important for all aspects of the modeling pipeline than improved algorithms. 2/
I certainly would have never realized any of this were it not for tabular Kaggle competitions. There, over many years, a community treasure trove of incredible tricks and insights had accumulated. Most of them unique. 3/
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

:(