There is a lot of often overlooked evidence that standard p(w) = N(0, a*I) priors combined with a NN f(x,w) induce a distribution over functions p(f(x)) with useful properties!... 1/15
The deep image prior shows this p(f(x)) captures low-level image statistics useful for image denoising, super-resolution, and inpainting. The rethinking generalization paper shows pre-processing data with a randomly initialized CNN can dramatically boost performance. 2/15
We show that the induced p(f(x)) has a reasonable correlation function, such that visually similar images are more correlated a priori. Moreover, the flatness arguments for SGD generalization suggest that good solutions take up a large volume in the corresponding posteriors. 3/15
We can also quantify this intuition and show that this prior leads to a marginal likelihood that favours structured image datasets over noisy image datasets, even if the network is able to perfectly fit the noisy datasets. 4/15
We also show these priors lead to posteriors that can alleviate double descent and provide significant performance gains! There are indeed *many* results showing BMA improves performance. Part of the skepticism about priors stems from the misconception that BDL doesn't work. 5/15
What about the result showing that samples from p(f(x)) assign nearly all data to one class? We show that is an artifact of choosing a bad signal variance 'a' in the N(0,a*I) prior, such that the softmax saturates. The 'a' is easy to tune, correcting this behaviour. 6/15
This is also a very soft prior bias, which is quickly modulated by data. Even after observing a little data, we see the bias quickly goes away in the posterior. We also see the actual predictive distribution (prev plot, row 3), even in the misspecified prior, is reasonable! 7/15
These nice results are intuitive. Many of the function-space properties of the prior, such as translation equivariance, are controlled by the architecture design. Designing a better prior would largely amount to architecture engineering. 8/15
But what about cold posteriors? Is it troubling that we sometimes improve results with T<1? While interesting, this is not necessarily bad news for BDL. There are many reasons this can happen, even with a well-specified prior and likelihood, Sec 8 (arxiv.org/pdf/2002.08791…). 9/15
Can the priors be improved? Certainly. Architecture design would be a key avenue for improvement. In some cases, I am puzzled by choosing a p(w) that induces a p(f(x)) that is like a GP with a standard kernel. Are we throwing the baby out with the bathwater? 10/15
Why do this? Don't we already have GPs with standard kernels? NNs are a distinct model class precisely because they have useful complementary inductive biases. While the standard p(f(x)) from p(w)=N(0,a*I) may be hard to interpret, that doesn't make it bad. 11/15
A reason for GP-like priors could be an asymptotic computational advantage over regular GPs. But we have many methods directly addressing GP scalability. Maybe a closed form expression for posterior samples? But this is not how they are often motivated. 12/15
Sometimes Masegosa's nice work (arxiv.org/abs/1912.08335) is used by others to claim we need better weight space priors. But misspecification in that paper is about not having enough support! Changing p(w) from N(0,a*I) is not going to help enlarge prior support! 13/15
I'm supportive of work trying to improve weight-space priors. But let's be careful not to uncritically adopt an overly pessimistic narrative because it appears "sober" and is a convenient rationalization for paper writing. Something doesn't need to be bad to make it better. 14/15
In short, there are many reasons to be optimistic! @Pavel_Izmailov and I discuss these reasons, and many other points, in our paper "Bayesian Deep Learning and a Probabilistic Perspective of Generalization": arxiv.org/abs/2002.08791. 15/15
Translation equivariance figure from Christian Wolf's blog, which also contains a nice animation: chriswolfvision.medium.com/what-is-transl…
Figure on 2/15 from the "Deep Image Prior" by Ulyanov, Vedaldi, Lempitsky: arxiv.org/abs/1711.10925

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrew Gordon Wilson

Andrew Gordon Wilson Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @andrewgwils

9 Dec
In practice, standard "deep ensembles" of independently trained models provides a relatively compelling Bayesian model average. This point is often overlooked because we are used to viewing Bayesian methods as sampling from some (approximate) posterior... 1/10
...to form a model average, via simple Monte Carlo. But if we instead directly consider what we ultimately want to compute, the integral corresponding to the marginal predictive distribution (the predictive distribution not conditioning on weights)... 2/10
...then deep ensembles are in practice a _better_ approximation to the Bayesian model average than methods that are conventionally accepted as Bayesian (such as Laplace, variational methods with a Gaussian posterior, etc.). 3/10
Read 10 tweets
27 Oct
We can greatly simplify Hamiltonian and Lagrangian neural nets by working in Cartesian coordinates with explicit constraints, leading to dramatic performance improvements! Our #NeurIPS2020 paper: arxiv.org/abs/2010.13581
with @m_finzi, @KAlexanderWang. 1/5
Complex dynamics can be described more simply with higher levels of abstraction. For example, a trajectory can be found by solving a differential equation. The differential equation can in turn be derived by a simpler Hamiltonian or Lagrangian, which is easier to model. 2/5
We can move further up the hierarchy of abstraction by working in Cartesian coordinates and explicitly representing constraints with Lagrange multipliers, for constrained Hamiltonian and Lagrangian neural networks (CHNNs and CLNNs) that face a much easier learning problem. 3/5
Read 5 tweets
26 May
Effective dimension compares favourably to popular path-norm and PAC-Bayes flatness measures, including double descent and width-depth trade-offs! We have just posted this new result in section 7 of our paper on posterior contraction in BDL: arxiv.org/abs/2003.02139. 1/16
The plots are most interpretable for comparing models of similar train loss (e.g. above the green partition). N_eff(Hess) = effective dimension of the Hessian at convergence. 2/16
Both path-norm and PAC-Bayes flatness variants perform well in the recent fantastic generalization measures paper of Jiang et. al (2019): arxiv.org/abs/1912.02178.
3/16
Read 16 tweets
21 Feb
Our new paper "Bayesian Deep Learning and a Probabilistic Perspective of Generalization": arxiv.org/abs/2002.08791. Includes (1) benefits of BMA; (2) BMA <-> Deep Ensembles; (3) new methods; (4) BNN priors; (5) generalization in DL; (6) tempering in BDL. With @Pavel_Izmailov. 1/19
Since neural nets can fit images with noisy labels, it has been suggested we should rethink generalization. But this behaviour is understandable from a probabilistic perspective: we want to support any possible solution, but also have good inductive biases. 2/19
The inductive biases determine what solutions are a priori likely. Indeed, we show this seemingly mysterious behaviour is not unique to neural nets: GPs with RBF kernels can perfectly fit noisy CIFAR, but also generalize on the noise free problem. 3/19
Read 20 tweets
27 Dec 19
Bayesian methods are *especially* compelling for deep neural networks. The key distinguishing property of a Bayesian approach is marginalization instead of optimization, not the prior, or Bayes rule. This difference will be greatest for underspecified models like DNNs. 1/18
In particular, the predictive distribution we often want to find is p(y|x,D) = \int p(y|x,w) p(w|D) dw. 'y' is an output, 'x' an input, 'w' the weights, and D the data. This is not a controversial equation, it is simply the sum and product rules of probability. 2/18
Rather than betting everything on a single hypothesis, we want to use every setting of parameters, weighted by posterior probabilities. This procedure is known as a Bayesian model average (BMA). 3/18
Read 18 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!