**
This Thread may be Removed Anytime!**

Twitter may remove this content at anytime! Save it as PDF for later use!

- Follow @ThreadReaderApp to mention us!
- From a Twitter thread mention us with a keyword "unroll"

`@threadreaderapp unroll`

Practice here first or read more on our help page!

This is a real problem with the way machine learning is often taught: ML seems like a disjoint laundry list of methods and topics to memorize. But in actuality the material is deeply unified... 1/8

From a probabilistic perspective, whether we are doing supervised, semi-supervised, or unsupervised learning, forming our training objective involves starting with an observation model, turning it into a likelihood, introducing a prior, and then taking our log posterior. 2/8

Our negative log posterior factorizes as -log p(w|D) = -log p(D|w) - log p(w) + c, where 'w' are parameters we want to estimate, and 'D' is the data. For regression with Gaussian noise, our negative log likelihood is squared error. Laplace noise? We get absolute error. 3/8

What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more.

arxiv.org/abs/2104.14421

With @Pavel_Izmailov, @sharadvikram, and Matthew D. Hoffman. 1/10

arxiv.org/abs/2104.14421

With @Pavel_Izmailov, @sharadvikram, and Matthew D. Hoffman. 1/10

There is a lot of often overlooked evidence that standard p(w) = N(0, a*I) priors combined with a NN f(x,w) induce a distribution over functions p(f(x)) with useful properties!... 1/15

In practice, standard "deep ensembles" of independently trained models provides a relatively compelling Bayesian model average. This point is often overlooked because we are used to viewing Bayesian methods as sampling from some (approximate) posterior... 1/10

...to form a model average, via simple Monte Carlo. But if we instead directly consider what we ultimately want to compute, the integral corresponding to the marginal predictive distribution (the predictive distribution not conditioning on weights)... 2/10

...then deep ensembles are in practice a _better_ approximation to the Bayesian model average than methods that are conventionally accepted as Bayesian (such as Laplace, variational methods with a Gaussian posterior, etc.). 3/10

We can greatly simplify Hamiltonian and Lagrangian neural nets by working in Cartesian coordinates with explicit constraints, leading to dramatic performance improvements! Our #NeurIPS2020 paper: arxiv.org/abs/2010.13581

with @m_finzi, @KAlexanderWang. 1/5

with @m_finzi, @KAlexanderWang. 1/5

Effective dimension compares favourably to popular path-norm and PAC-Bayes flatness measures, including double descent and width-depth trade-offs! We have just posted this new result in section 7 of our paper on posterior contraction in BDL: arxiv.org/abs/2003.02139. 1/16

The plots are most interpretable for comparing models of similar train loss (e.g. above the green partition). N_eff(Hess) = effective dimension of the Hessian at convergence. 2/16

Both path-norm and PAC-Bayes flatness variants perform well in the recent fantastic generalization measures paper of Jiang et. al (2019): arxiv.org/abs/1912.02178.

3/16

3/16