Coder & Research director @inria
â–şData, Health, & Computer science
â–şPython coder, (co)founder of @scikit_learn, joblib, @probabl_ai
â–şArt: @artgael
â–şPhysics PhD
Jun 3 • 14 tweets • 5 min read
✨ #ICML2024 accepted! CARTE: Pretraining and Transfer for Tabular Learning
Why this is a jump forward for tabular deep learning 🤯, a stepping stone for tabular foundation models 🎉, and a study with much teaching on learning on real tables 👇
1/13arxiv.org/abs/2402.16785
Teaser: the contribution leads to sizeable improvements compared to many strong baselines, across 51 datasets.
We worked really hard on the baselines, testing many, some being new combination of tools (many teachings on neural networks and handling categories)
2/13
Jan 30, 2023 • 10 tweets • 5 min read
⚠️A widespread confusion: calibration of predictors, as measured by expected calibration error, does not control completely that the predictor gives perfect probabilities P(y|X):
A predictor may be overconfident on some individuals and underconfident on others
🧵
1/10
The question is: do confidence scores of predictors correspond to actually controlled probabilities?
This question is important to take decisions balancing harm-benefit tradeoffs, eg in medicine, and has motivated characterizing calibration of predictors.
2/10
Jan 17, 2023 • 5 tweets • 3 min read
Our benchmark of tree-based models vs deep learning for tabular data: final version.
TL:DR: from small compute budget, @scikit_learn's HistGradientBoosting is best. With finer tuning of hyperparams, XGBoost brings a gain (here n ranges from 3,000 to 10,000)
For large datasets (> 10,000), the picture differs slightly:
â–¸ Classification: deep learning brings benefits for limited compute power
â–¸ Regression: XGBoost always outperforms @scikit_learn's HistGradientBoosting
(these are relative units)
Oct 13, 2022 • 4 tweets • 2 min read
I make about 3200€ net / month (2760€ after tax) + yearly bonus ~ 6000€ (correcting previous tweet).
I'm a research director (tenured prof equivalent), 13 years after PhD, with (I think) a good track record.
Why I think that my salary is not too low👇
First, 80% of the French workers earns less than I do. So, I am, all in all privileged insee.fr/fr/statistique…
(we typically get these numbers wrong, so a reality check is useful) 2/3
Jul 19, 2022 • 11 tweets • 4 min read
⚡️Preprint: Why do tree-based models still outperform deep learning on tabular data?
We give solid evidence that, on tabular data, achieving good prediction is easier with tree methods than deep learning (even modern architectures) and explore why hal.archives-ouvertes.fr/hal-03723551
1/9
We explicit what differentiates tabular data to signals (heterogeneity of columns) and select 45 open datasets, defining a standard benchmark.
We study average performance as a function of hyperparameter tuning budget: tree methods give best performance with less tuning.
2/9
May 4, 2021 • 5 tweets • 3 min read
New paper in @PLOSCompBiol: Extracting representations of cognition across neuroimaging studies improves brain decoding
A deep model, for transfer learning, decoding a completely new study via universal latent representations of cognition journals.plos.org/ploscompbiol/a… 1/5
The challenge we address is that small cognitive neuroimaging studies address precise cognitive question, but typical suffer small statistical power.
We accumulate data across studies to improve statistical performance of decoding in studies. Smaller studies benefit most.
2/5
Mar 5, 2021 • 8 tweets • 3 min read
New preprint: Accounting for Variance in Machine Learning Benchmarks
Lead by @bouthilx and @Mila_Quebec friends
We show that ML benchmarks contain multiple sources of uncontrolled variation, not only inits. We propose procedure for reliable conclusion 1/8arxiv.org/abs/2103.03098
Data split and hyper-parameter selection (even with fancy hyper-parameter optimization) appear as the leading source of arbitrary variations in ML benchmarks, beyond random weight init.
These must be sampled to give empirical evidence on algorithm comparison that generalize 2/8
Goal: predict with various missing mechanisms
Thread 1/5
The intuition: as features go missing, the best predictor must use covariances between features to compensate on the slope of observed features.
Classic approach: fitting with EM a probabilistic model.
Its limitations: requires model of missing mechanism & intractable with p 2/n
Feb 1, 2020 • 8 tweets • 3 min read
Even for science and medical applications, I am becoming weary of fine statistical modeling efforts, and believe that we should standardize on a handful of powerful and robust methods.
Given two set of observations, how to know if they are drawn from the same distribution? Short answer in the thread..
For instance, do McDonald’s and KFC use different logic to position restaurants? Difficult question! We have access to data points, but not the underlying generative mechanism, governed by marketing strategies.