Frank Hutter Profile picture
Jan 8 19 tweets 6 min read Read on X
The data science revolution is getting closer. TabPFN v2 is published in Nature: On tabular classification with up to 10k data points & 500 features, in 2.8s TabPFN on average outperforms all other methods, even when tuning them for up to 4 hours🧵1/19 nature.com/articles/s4158…Image
2/19 Two years ago, I tweeted about TabPFN v1 “This may revolutionize data science”. I meant this as “This line of work may revolutionize tabular ML”. We’re now a step closer to this. Like every model, TabPFN v2 will have failure modes, but it starts to get closer to the promise.
3/19 TabPFN v1 was an eye-opener about the potential of in-context learning for classification, but it had many failure modes. With v2, we improve classification & extend the capabilities to regression (where in 4.8 seconds, it is better than baselines tuned for 4h). Image
4/19 How does TabPFN work? It is trained on 130 million synthetic tabular prediction datasets to perform in-context learning and output predictive distributions. Each dataset is one meta-datapoint to train the TabPFN weights with SGD. Image
5/19 In contrast to TabPFN v1, we now natively support categorical features. TabPFN v2 now performs just as well on datasets with and without categorical features. Image
6/19 In contrast to TabPFN v1, we now natively support missing values. TabPFN v2 now performs just as well on datasets with and without missing values. Image
7/19 In contrast to TabPFN v1, we now natively support uninformative features. While these throw off standard neural nets (see the MLP in the figure), TabPFN v2 now handles them naturally. Image
8/19 TabPFN v2 handles outliers well. Note how dramatically standard MLPs are affected by these. Image
9/19 TabPFN v2 performs as well with half the data as the next best baseline classifier (CatBoost) performs with all the data. This could be huge for applications where data is very scarce (rare diseases, clinical studies, etc). Image
10/19 How does TabPFN v2 fare against ensembles of tuned models? We compared it to the state-of-the-art AutoML system AutoGluon 1.0. Standard TabPFN already outperforms AutoGluon on classification, but ensembling multiple TabPFNs in TabPFN v2 (PHE) is even better. Image
11/19 For regression, standard TabPFN v2 performs on par with AutoGluon, but TabPFN v2 (PHE) still performs better. Image
12/19 In qualitative instead of quantitative experiments, we show that TabPFNv2 models simple functions more effectively than baseline methods. The orange points indicate the training data, while the blue points represent predictions. Image
13/19 TabPFN v2 has many foundation model properties, most importantly it can be finetuned to a specific type of datasets (here: sine curves), just like Llama can be finetuned to your text data. Image
14/19 TabPFN v2 still has some downsides. While it is very fast to train and does not require hyperparameter tuning, inference is slow. There are many ways to handle this better in the future, and we’re actively working on it.
15/19 TabPFN v2 is also only made for datasets up to 10k data points and 500 features. It may do well for some larger datasets, but scaling it up further has not been our focus so far. We’re actively working on this now.
16/19 We are releasing TabPFN under an open license: a derivative of the Apache 2 license with a single modification, adding an enhanced attribution requirement inspired by the Llama 3 license: github.com/PriorLabs/TabP…
17/19 We are also releasing an API to allow computation to happen on our GPUs rather than yours. If you are happy to share your data with us, please use this to help us improve our model: github.com/PriorLabs/tabp…
18/19 Finally, we are committed to building an ecosystem around TabPFN and have created a repository for community contributions: We’re excited about your contributions ❤️github.com/PriorLabs/tabp…
19/19 Overall, we are super excited about TabPFN and hope that it is useful for you, too. We are eager to hear feedback on our Discord channel:
Let’s built something great together 🚀discord.com/invite/VJRuU3b…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Frank Hutter

Frank Hutter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @FrankRHutter

Oct 21, 2022
This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes 1 second & yields SOTA performance (better than hyperparameter-optimized gradient boosting in 1h). Current limits: up to 1k data points, 100 features, 10 classes. 🧵1/6
TabPFN is radically different from previous ML methods. It is meta-learned to approximate Bayesian inference with a prior based on principles of causality and simplicity. Here‘s a qualitative comparison to some sklearn classifiers, showing very smooth uncertainty estimates. 2/6
TabPFN happens to be a transformer, but this is not your usual trees vs nets battle. Given a new data set, there is no costly gradient-based training. Rather, it’s a single forward pass of a fixed network: you feed in (Xtrain, ytrain, Xtest); the network outputs p(y_test). 3/6
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(