Tweet

https://twitter.com/dabeaz/status/1398625259708993538

More from @andrewgwils

Andrew Gordon Wilson

@andrewgwils

30 Apr

@Pavel_Izmailov

What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more.
arxiv.org/abs/2104.14421
With @Pavel_Izmailov, @sharadvikram, and Matthew D. Hoffman. 1/10

We show that Bayesian neural networks reassuringly provide good generalization, outperforming deep ensembles, standard training, and many approximate inference procedures, even with a single chain. 2/10

However, we find that BNNs are surprisingly poor at OOD generalization, even worse than SGD, despite the popularity of approximate inference in this setting, and the relatively good performance of BNNs for OOD detection. 3/10

Read 10 tweets

Andrew Gordon Wilson

@andrewgwils

29 Dec 20

There is a lot of often overlooked evidence that standard p(w) = N(0, a*I) priors combined with a NN f(x,w) induce a distribution over functions p(f(x)) with useful properties!... 1/15

The deep image prior shows this p(f(x)) captures low-level image statistics useful for image denoising, super-resolution, and inpainting. The rethinking generalization paper shows pre-processing data with a randomly initialized CNN can dramatically boost performance. 2/15

We show that the induced p(f(x)) has a reasonable correlation function, such that visually similar images are more correlated a priori. Moreover, the flatness arguments for SGD generalization suggest that good solutions take up a large volume in the corresponding posteriors. 3/15

Read 17 tweets

Andrew Gordon Wilson

@andrewgwils

9 Dec 20

In practice, standard "deep ensembles" of independently trained models provides a relatively compelling Bayesian model average. This point is often overlooked because we are used to viewing Bayesian methods as sampling from some (approximate) posterior... 1/10

...to form a model average, via simple Monte Carlo. But if we instead directly consider what we ultimately want to compute, the integral corresponding to the marginal predictive distribution (the predictive distribution not conditioning on weights)... 2/10

...then deep ensembles are in practice a _better_ approximation to the Bayesian model average than methods that are conventionally accepted as Bayesian (such as Laplace, variational methods with a Gaussian posterior, etc.). 3/10

Read 10 tweets

Andrew Gordon Wilson

@andrewgwils

27 Oct 20

@m_finzi

We can greatly simplify Hamiltonian and Lagrangian neural nets by working in Cartesian coordinates with explicit constraints, leading to dramatic performance improvements! Our #NeurIPS2020 paper: arxiv.org/abs/2010.13581
with @m_finzi, @KAlexanderWang. 1/5

Complex dynamics can be described more simply with higher levels of abstraction. For example, a trajectory can be found by solving a differential equation. The differential equation can in turn be derived by a simpler Hamiltonian or Lagrangian, which is easier to model. 2/5

We can move further up the hierarchy of abstraction by working in Cartesian coordinates and explicitly representing constraints with Lagrange multipliers, for constrained Hamiltonian and Lagrangian neural networks (CHNNs and CLNNs) that face a much easier learning problem. 3/5

Read 5 tweets

Andrew Gordon Wilson

@andrewgwils

26 May 20

Effective dimension compares favourably to popular path-norm and PAC-Bayes flatness measures, including double descent and width-depth trade-offs! We have just posted this new result in section 7 of our paper on posterior contraction in BDL: arxiv.org/abs/2003.02139. 1/16

The plots are most interpretable for comparing models of similar train loss (e.g. above the green partition). N_eff(Hess) = effective dimension of the Hessian at convergence. 2/16

Both path-norm and PAC-Bayes flatness variants perform well in the recent fantastic generalization measures paper of Jiang et. al (2019): arxiv.org/abs/1912.02178.
3/16

Read 16 tweets

Andrew Gordon Wilson

@andrewgwils

21 Feb 20

@Pavel_Izmailov

Our new paper "Bayesian Deep Learning and a Probabilistic Perspective of Generalization": arxiv.org/abs/2002.08791. Includes (1) benefits of BMA; (2) BMA <-> Deep Ensembles; (3) new methods; (4) BNN priors; (5) generalization in DL; (6) tempering in BDL. With @Pavel_Izmailov. 1/19

Since neural nets can fit images with noisy labels, it has been suggested we should rethink generalization. But this behaviour is understandable from a probabilistic perspective: we want to support any possible solution, but also have good inductive biases. 2/19

The inductive biases determine what solutions are a priori likely. Indeed, we show this seemingly mysterious behaviour is not unique to neural nets: GPs with RBF kernels can perfectly fit noisy CIFAR, but also generalize on the noise free problem. 3/19

Read 20 tweets

Share this page!

Andrew Gordon Wilson

Try unrolling a thread yourself!

More from @andrewgwils

Andrew Gordon Wilson

Andrew Gordon Wilson

Andrew Gordon Wilson

Andrew Gordon Wilson

Andrew Gordon Wilson

Andrew Gordon Wilson

Did Thread Reader help you today?

Like this author's thread?