My Authors
Read all threads
Bayesian methods are *especially* compelling for deep neural networks. The key distinguishing property of a Bayesian approach is marginalization instead of optimization, not the prior, or Bayes rule. This difference will be greatest for underspecified models like DNNs. 1/18
In particular, the predictive distribution we often want to find is p(y|x,D) = \int p(y|x,w) p(w|D) dw. 'y' is an output, 'x' an input, 'w' the weights, and D the data. This is not a controversial equation, it is simply the sum and product rules of probability. 2/18
Rather than betting everything on a single hypothesis, we want to use every setting of parameters, weighted by posterior probabilities. This procedure is known as a Bayesian model average (BMA). 3/18
Classical training can be viewed as approximate Bayesian inference, where the approximate posterior is a delta function centred at the max likelihood or MAP setting of the parameters, p(w|D) ≈ \delta(w=w_{MAP}). Thus many alternatives, albeit imperfect, can be preferable. 4/18
The difference will depend on how concentrated the posterior p(w|D) becomes. If it is sharply peaked, there may be almost no difference between classical and Bayesian approaches. However, DNNs are underspecified by the data, and will thus have diffuse likelihoods p(D|w). 5/18
Not only are the likelihoods diffuse, but NNs are capable of expressing many compelling and different representations, corresponding to different parameters. This is exactly when we *most* want to do a BMA. We will get a rich ensemble, for better accuracy and calibration. 6/18
The recent success of deep ensembles is not discouraging, but indeed great motivation for BDL. These ensembles are from the *same DNN* using weights trained from random initializations. It *is* approximate BMA, using weights with high likelihood and functional diversity. 7/18
Rather than a single point mass, we are now using multiple point masses in good locations, which will be a much better approximation of the BMA integral we are trying to solve. 8/18
Regarding priors, the prior that matters is the prior in function space, not parameter space. In the case of a Gaussian process, a vague prior would be disastrous (a white noise prior), because it is a prior directly in function space. 9/18
However, when we combine a vague prior over parameters with a structured functional form such as a CNN, it induces a structured prior in function space. Indeed, the inductive biases and equivariance constraints in such models is why they work well in classical settings. 10/18
Vague priors over parameters are often also a reasonable description of our a priori subjective beliefs, and typically much better than entirely ignoring epistemic uncertainty, which leads to worse performance and miscalibration. 11/18
There are many examples where flat priors combined with *marginalization* sidestep pathologies of max likelihood. Priors without marginalization are simply regularization, but Bayesian methods are not about regularization (MacKay, Ch 28). 12/18
And there is a whole field of research devoted to approximate Bayesian methods with uninformative priors over *parameters* (but not functions). It is well-motivated, marginalization is still compelling, and the results are often better than regularized optimization. 13/18
By accounting for epistemic uncertainty through uninformative *parameter* (but not function) priors, we, as a community, have developed BDL methods with improved calibration, reliable predictive distributions, and improved accuracy. 14/18
Of course, we can always make better assumptions — Bayesian or not. We should strive to build more interpretable parameter priors. And we should build better posterior approximations. Ensembles are a promising step in this direction. 15/18
But we should not undermine the progress we are making so far. Bayesian inference is especially compelling for DNNs. BDL is gaining visibility because we are making progress, with good and increasingly scalable practical results. We shouldn’t discourage these efforts. 16/18
If we are shying away from an approximate Bayesian approach because of some challenge or imperfection, we should always ask, “what’s the alternative”? The alternative may indeed be a more impoverished representation of the predictive distribution we want to compute. 17/18
And we should not back away from challenges. I present some of these views in my talk at the #NeurIPS2019 Bayesian Deep Learning Workshop, starting at 6m40s: slideslive.com/38921875/bayes…
18/18
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Andrew Gordon Wilson

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!