9 Dec, 10 tweets, 2 min read
In practice, standard "deep ensembles" of independently trained models provides a relatively compelling Bayesian model average. This point is often overlooked because we are used to viewing Bayesian methods as sampling from some (approximate) posterior... 1/10
...to form a model average, via simple Monte Carlo. But if we instead directly consider what we ultimately want to compute, the integral corresponding to the marginal predictive distribution (the predictive distribution not conditioning on weights)... 2/10
...then deep ensembles are in practice a _better_ approximation to the Bayesian model average than methods that are conventionally accepted as Bayesian (such as Laplace, variational methods with a Gaussian posterior, etc.). 3/10
This isn't just an issue of semantics, but a practically and conceptually important realization. It makes sense to view the Bayesian integration problem in DL as an active learning problem under severe computational constraints, and that's what deep ensembles are doing. 4/10
If you have to approximate p(y|D) = \int p(y|w) p(w|D) dw by querying a handful of points in weight space, you wouldn't even want exact samples from the posterior p(w|D)! You would care about getting high density (and typical) points with functional variability. 5/10
Would these points be equally weighted in a BMA? Well, is the posterior density the same? Yes, typically. And do ensembles even have to be equally weighted? No. 6/10
Do BMAs for neural networks contain an infinite number of points? If we do the integral exactly, yes. But in practice, not even close! In practice, conventional deep BNN methods are taking about a dozen samples from a unimodal approximate posterior. 7/10
Under standard computational constraints, deep ensembles will get you a better approximation of the BMA than most conventional Bayesian approaches. You could do better if you ran a good MCMC method for a really really long time. 8/10
Are ensembles and Bayesian methods always the same? No. Ensembling methods can enrich the hypothesis space, whereas Bayesian methods assume one correct hypothesis. But this distinction doesn’t apply for "deep ensembles", which don't enrich the hypothesis space. 9/10
We discuss these topics, and many others, in “Bayesian Deep Learning and a Probabilistic Perspective of Generalization" (arxiv.org/pdf/2002.08791…). See you at NeurIPS on Th! neurips.cc/virtual/2020/p…

10/10

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

# More from @andrewgwils

27 Oct
We can greatly simplify Hamiltonian and Lagrangian neural nets by working in Cartesian coordinates with explicit constraints, leading to dramatic performance improvements! Our #NeurIPS2020 paper: arxiv.org/abs/2010.13581
with @m_finzi, @KAlexanderWang. 1/5
Complex dynamics can be described more simply with higher levels of abstraction. For example, a trajectory can be found by solving a differential equation. The differential equation can in turn be derived by a simpler Hamiltonian or Lagrangian, which is easier to model. 2/5
We can move further up the hierarchy of abstraction by working in Cartesian coordinates and explicitly representing constraints with Lagrange multipliers, for constrained Hamiltonian and Lagrangian neural networks (CHNNs and CLNNs) that face a much easier learning problem. 3/5
26 May
Effective dimension compares favourably to popular path-norm and PAC-Bayes flatness measures, including double descent and width-depth trade-offs! We have just posted this new result in section 7 of our paper on posterior contraction in BDL: arxiv.org/abs/2003.02139. 1/16
The plots are most interpretable for comparing models of similar train loss (e.g. above the green partition). N_eff(Hess) = effective dimension of the Hessian at convergence. 2/16
Both path-norm and PAC-Bayes flatness variants perform well in the recent fantastic generalization measures paper of Jiang et. al (2019): arxiv.org/abs/1912.02178.
3/16
21 Feb
Our new paper "Bayesian Deep Learning and a Probabilistic Perspective of Generalization": arxiv.org/abs/2002.08791. Includes (1) benefits of BMA; (2) BMA <-> Deep Ensembles; (3) new methods; (4) BNN priors; (5) generalization in DL; (6) tempering in BDL. With @Pavel_Izmailov. 1/19
Since neural nets can fit images with noisy labels, it has been suggested we should rethink generalization. But this behaviour is understandable from a probabilistic perspective: we want to support any possible solution, but also have good inductive biases. 2/19
The inductive biases determine what solutions are a priori likely. Indeed, we show this seemingly mysterious behaviour is not unique to neural nets: GPs with RBF kernels can perfectly fit noisy CIFAR, but also generalize on the noise free problem. 3/19
27 Dec 19
Bayesian methods are *especially* compelling for deep neural networks. The key distinguishing property of a Bayesian approach is marginalization instead of optimization, not the prior, or Bayes rule. This difference will be greatest for underspecified models like DNNs. 1/18
In particular, the predictive distribution we often want to find is p(y|x,D) = \int p(y|x,w) p(w|D) dw. 'y' is an output, 'x' an input, 'w' the weights, and D the data. This is not a controversial equation, it is simply the sum and product rules of probability. 2/18
Rather than betting everything on a single hypothesis, we want to use every setting of parameters, weighted by posterior probabilities. This procedure is known as a Bayesian model average (BMA). 3/18