What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more. arxiv.org/abs/2104.14421
With @Pavel_Izmailov, @sharadvikram, and Matthew D. Hoffman. 1/10
We show that Bayesian neural networks reassuringly provide good generalization, outperforming deep ensembles, standard training, and many approximate inference procedures, even with a single chain. 2/10
However, we find that BNNs are surprisingly poor at OOD generalization, even worse than SGD, despite the popularity of approximate inference in this setting, and the relatively good performance of BNNs for OOD detection. 3/10
Even though deep ensembles are often talked about as a "non-Bayesian" alternative to standard approximate inference, we find they approximate the HMC predictive distribution better than MFVI, and about as well as standard SGLD. 4/10
There has been much attention lately on "cold posteriors" in BDL, where the posterior raised to a power 1/T with T<1 can lead to better results. We see little evidence for a general cold posterior effect, which we find is largely due to data augmentation. 5/10
We explored Gaussian, mixture of Gaussian, and heavy-tailed logistic priors, which performed similarly, although the heavy-tailed priors did slightly better. We also found performance relatively insensitive to the scale of the Gaussian prior... 6/10
...these results highlight the relative importance of the architecture compared to the distribution over weights in defining the induced prior over functions. Indeed, other work shows that even standard Gaussian priors have many useful properties: arxiv.org/abs/2002.08791. 7/10
We present many other results, including mixing in function space vs. weight space, posterior geometry and mode connecting paths, single chain vs. multi-chain...! 8/10
Many of the results, both positive and negative for BDL, are contrary to conventional wisdom. 9/10
We worked hard to obtain these HMC samples, which we plan to release as a public resource, as a reference for evaluating more practical alternatives to HMC, and for researchers to explore their own questions around approximate inference in BDL. 10/10
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Naively using LLMs like GPT-3 for time series extrapolation can fail out of the box because of suboptimal tokenization and preprocessing. We show that if we tokenize numbers to individual digits, LLMs really shine! 2/7
We also show that language models are surprisingly natural probabilistic models of continuous data, acting like hierarchical softmax distributions over numbers when tokenized into individual digits. This allows them to fit challenging distributions common in time series. 3/7
Last year at ICML, we presented marginal likelihood pathologies in model selection and hyper learning. We now have a 60 page JMLR extension featuring: 1) should we be comforted by connections with PAC-Bayes? 2) approximations; 3) architecture search.
To recap, the marginal likelihood answers the question "how likely is my prior to generate the training data?" which is fundamentally different than "will my trained model provide good generalization?", leading to many discrepancies. See
2/16
In short, the log marginal likelihood (LML) can underfit, overfit, and heavily penalize diffuse priors that provide good generalization. The decomposition of the LML into a sum of log p(D_i|D<i) suggests a partial remedy, the conditional LML (CLML), removing the first terms. 3/16
The marginal likelihood (evidence) provides an elegant approach to hypothesis testing and hyperparameter learning, but it has fascinating limits as a generalization proxy, with resolutions.
The search for scientific truth is elusive. How do we select between theories which are entirely consistent with any data we observe? The marginal likelihood p(D|M) -- the probability we would generate our observations from our prior model -- provides a compelling approach. 2/23
MacKay's book, Ch. 28, makes a nice case: a simple model can't generate many datasets, but since p(D|M) is a normalized probability density, it gives high probability to the data it can generate. For a given dataset, the most constrained model wins, encoding "Occam's razor". 3/23
Suppose for instance there are dead pixels in an image. The weights attached to these pixels don’t affect the predictions, and so MAP (regularized optimization) drives them to zero. A BMA instead samples these weights from the prior... 2/5
...For in-distribution test data this behaviour doesn’t hurt generalization. But now suppose for example some corruption is added to the image. Now the BMA is activating connections that should be dead, hurting OOD generalization! 3/5
Does knowledge distillation really work?
While distillation can improve student generalization, we show it is extremely difficult to achieve good agreement between student and teacher.
We decouple our understanding of good fidelity --- high student teacher agreement --- from good student generalization. 2/10
The conventional narrative is that knowledge distillation "distills knowledge" from a big teacher to a small student through the information in soft labels. However, in actuality, the student is often not much more like the teacher than an independently trained model! 3/10
This is a real problem with the way machine learning is often taught: ML seems like a disjoint laundry list of methods and topics to memorize. But in actuality the material is deeply unified... 1/8
From a probabilistic perspective, whether we are doing supervised, semi-supervised, or unsupervised learning, forming our training objective involves starting with an observation model, turning it into a likelihood, introducing a prior, and then taking our log posterior. 2/8
Our negative log posterior factorizes as -log p(w|D) = -log p(D|w) - log p(w) + c, where 'w' are parameters we want to estimate, and 'D' is the data. For regression with Gaussian noise, our negative log likelihood is squared error. Laplace noise? We get absolute error. 3/8