Goal: predict with various missing mechanisms
Thread 1/5
The intuition: as features go missing, the best predictor must use covariances between features to compensate on the slope of observed features.
Classic approach: fitting with EM a probabilistic model.
Its limitations: requires model of missing mechanism & intractable with p 2/n
Our approach: write the optimal predictor under various assumptions, approximate with a differentiable function composition: a Neural Network.
This theory leads to introduce a new non-linearity: the multiplication by the missingness mask at each layer 3/5
This non-linearity has a much better approximation capability then wide or deep MLPs, in theory and in practice
(our previous work showed that wide ReLU MLPs are consistent with missing values proceedings.mlr.press/v108/morvan20a… ) 4/5
These approximations are good for multiple missing-value mechanisms, including missing _not_ at random, unlike EM or imputations (and these don't scale for many features).
The trick: differentiable programming to optimize a predictor function well suited for missing values
5/5
Ce travail va être présenté en Français mardi 08/12 pour le déjeuner
Even for science and medical applications, I am becoming weary of fine statistical modeling efforts, and believe that we should standardize on a handful of powerful and robust methods.
Given two set of observations, how to know if they are drawn from the same distribution? Short answer in the thread..
For instance, do McDonald’s and KFC use different logic to position restaurants? Difficult question! We have access to data points, but not the underlying generative mechanism, governed by marketing strategies.
To capture the information in the spatial proximity of data points, kernel mean embeddings are useful. They are intuitively related to Kernel Density Estimates