Latest Twitter Threads by @andrewgwils on Thread Reader App

Mar 5 • 12 tweets • 4 min read

My new paper "Deep Learning is Not So Mysterious or Different": . Generalization behaviours in deep learning can be intuitively understood through a notion of soft inductive biases, and formally characterized with countable hypothesis bounds! 1/12 arxiv.org/abs/2503.02113

What makes deep learning different? Not overparametrization, benign overfitting, or double descent, which can be reproduced with other models and explained with old generalization frameworks. Understanding DL doesn't require rethinking generalization -- and it never did! 2/12

Oct 13, 2023 • 7 tweets • 3 min read

LLMs aren't just next-word predictors, they are also compelling zero-shot time series forecasters! Our new NeurIPS paper:

w/ @gruver_nate, @m_finzi, @ShikaiQiu
1/7 arxiv.org/abs/2310.07820

Naively using LLMs like GPT-3 for time series extrapolation can fail out of the box because of suboptimal tokenization and preprocessing. We show that if we tokenize numbers to individual digits, LLMs really shine!
2/7

Jul 6, 2023 • 16 tweets • 6 min read

Last year at ICML, we presented marginal likelihood pathologies in model selection and hyper learning. We now have a 60 page JMLR extension featuring: 1) should we be comforted by connections with PAC-Bayes? 2) approximations; 3) architecture search.

1/16 https://t.co/O8xSntGDhvarxiv.org/abs/2202.11678

https://twitter.com/LotfiSanae/status/1549764850703892481

To recap, the marginal likelihood answers the question "how likely is my prior to generate the training data?" which is fundamentally different than "will my trained model provide good generalization?", leading to many discrepancies. See
2/16

https://twitter.com/andrewgwils/status/1496663760131563524?s=20

Feb 24, 2022 • 23 tweets • 8 min read

The marginal likelihood (evidence) provides an elegant approach to hypothesis testing and hyperparameter learning, but it has fascinating limits as a generalization proxy, with resolutions.

arxiv.org/abs/2202.11678

w/ @LotfiSanae, @Pavel_Izmailov, @g_benton_, @micahgoldblum 1/23

The search for scientific truth is elusive. How do we select between theories which are entirely consistent with any data we observe? The marginal likelihood p(D|M) -- the probability we would generate our observations from our prior model -- provides a compelling approach. 2/23

Jun 23, 2021 • 5 tweets • 2 min read

Despite its popularity in the covariate shift setting, Bayesian model averaging can surprisingly hurt OOD generalization! arxiv.org/abs/2106.11905 1/5

https://twitter.com/Pavel_Izmailov/status/1407522681516331016

Suppose for instance there are dead pixels in an image. The weights attached to these pixels don’t affect the predictions, and so MAP (regularized optimization) drives them to zero. A BMA instead samples these weights from the prior... 2/5

Jun 11, 2021 • 10 tweets • 5 min read

Does knowledge distillation really work?
While distillation can improve student generalization, we show it is extremely difficult to achieve good agreement between student and teacher.

arxiv.org/abs/2106.05945
With @samscub, @Pavel_Izmailov, @polkirichenko, Alex Alemi. 1/10

We decouple our understanding of good fidelity --- high student teacher agreement --- from good student generalization. 2/10

Jun 1, 2021 • 8 tweets • 2 min read

This is a real problem with the way machine learning is often taught: ML seems like a disjoint laundry list of methods and topics to memorize. But in actuality the material is deeply unified... 1/8

https://twitter.com/dabeaz/status/1398625259708993538

From a probabilistic perspective, whether we are doing supervised, semi-supervised, or unsupervised learning, forming our training objective involves starting with an observation model, turning it into a likelihood, introducing a prior, and then taking our log posterior. 2/8

Apr 30, 2021 • 10 tweets • 5 min read

What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more.
arxiv.org/abs/2104.14421
With @Pavel_Izmailov, @sharadvikram, and Matthew D. Hoffman. 1/10

We show that Bayesian neural networks reassuringly provide good generalization, outperforming deep ensembles, standard training, and many approximate inference procedures, even with a single chain. 2/10

Dec 29, 2020 • 17 tweets • 6 min read

There is a lot of often overlooked evidence that standard p(w) = N(0, a*I) priors combined with a NN f(x,w) induce a distribution over functions p(f(x)) with useful properties!... 1/15

The deep image prior shows this p(f(x)) captures low-level image statistics useful for image denoising, super-resolution, and inpainting. The rethinking generalization paper shows pre-processing data with a randomly initialized CNN can dramatically boost performance. 2/15

Dec 9, 2020 • 10 tweets • 2 min read

In practice, standard "deep ensembles" of independently trained models provides a relatively compelling Bayesian model average. This point is often overlooked because we are used to viewing Bayesian methods as sampling from some (approximate) posterior... 1/10

...to form a model average, via simple Monte Carlo. But if we instead directly consider what we ultimately want to compute, the integral corresponding to the marginal predictive distribution (the predictive distribution not conditioning on weights)... 2/10

Oct 27, 2020 • 5 tweets • 4 min read

We can greatly simplify Hamiltonian and Lagrangian neural nets by working in Cartesian coordinates with explicit constraints, leading to dramatic performance improvements! Our #NeurIPS2020 paper: arxiv.org/abs/2010.13581
with @m_finzi, @KAlexanderWang. 1/5

Complex dynamics can be described more simply with higher levels of abstraction. For example, a trajectory can be found by solving a differential equation. The differential equation can in turn be derived by a simpler Hamiltonian or Lagrangian, which is easier to model. 2/5

May 26, 2020 • 16 tweets • 4 min read

Effective dimension compares favourably to popular path-norm and PAC-Bayes flatness measures, including double descent and width-depth trade-offs! We have just posted this new result in section 7 of our paper on posterior contraction in BDL: arxiv.org/abs/2003.02139. 1/16

The plots are most interpretable for comparing models of similar train loss (e.g. above the green partition). N_eff(Hess) = effective dimension of the Hessian at convergence. 2/16

Feb 21, 2020 • 20 tweets • 7 min read

Our new paper "Bayesian Deep Learning and a Probabilistic Perspective of Generalization": arxiv.org/abs/2002.08791. Includes (1) benefits of BMA; (2) BMA <-> Deep Ensembles; (3) new methods; (4) BNN priors; (5) generalization in DL; (6) tempering in BDL. With @Pavel_Izmailov. 1/19

Since neural nets can fit images with noisy labels, it has been suggested we should rethink generalization. But this behaviour is understandable from a probabilistic perspective: we want to support any possible solution, but also have good inductive biases. 2/19

Dec 27, 2019 • 18 tweets • 4 min read

Bayesian methods are *especially* compelling for deep neural networks. The key distinguishing property of a Bayesian approach is marginalization instead of optimization, not the prior, or Bayes rule. This difference will be greatest for underspecified models like DNNs. 1/18

https://twitter.com/carlesgelada/status/1208618401729568768

In particular, the predictive distribution we often want to find is p(y|x,D) = \int p(y|x,w) p(w|D) dw. 'y' is an output, 'x' an input, 'w' the weights, and D the data. This is not a controversial equation, it is simply the sum and product rules of probability. 2/18

Share this page!

Enter URL or ID to Unroll