Andrew Gordon Wilson Profile picture
Jun 11, 2021 10 tweets 5 min read Read on X
Does knowledge distillation really work?
While distillation can improve student generalization, we show it is extremely difficult to achieve good agreement between student and teacher.

arxiv.org/abs/2106.05945
With @samscub, @Pavel_Izmailov, @polkirichenko, Alex Alemi. 1/10
We decouple our understanding of good fidelity --- high student teacher agreement --- from good student generalization. 2/10
The conventional narrative is that knowledge distillation "distills knowledge" from a big teacher to a small student through the information in soft labels. However, in actuality, the student is often not much more like the teacher than an independently trained model! 3/10
Does the student not have the capacity to match the teacher? In self-distillation, the student often outperforms the teacher, which is only possible by virtue of failing at the distillation procedure. Moreover, increasing student capacity has little effect on fidelity. 4/10
Does the student not have enough data to match the teacher? Over an extensive collection of augmentation procedures, there is still a big fidelity gap, though some approaches help with generalization. Moreover, what's best for fidelity is often not best for generalization. 5/10
Is it an optimization problem? We find adding more distillation data substantially decreases train agreement. Despite having the lowest train agreement, combined augmentations lead to the best test agreement. 6/10
Can we make optimization easier? We replace BatchNorm with LayerNorm to ensure the student can *exactly* match the teacher, and use a simple data aug that has best train agreement. Many more training epochs and different optimizers only lead to minor changes in agreement. 7/10
Is there _anything_ we can do to produce a high fidelity student? In self-distillation the student can in principle match the teacher. We initialize the student with a combination of teacher and random weights. Starting close enough, we can finally recover the teacher. 8/10
In general deep learning, we are saved by not actually needing to do good optimization: while our training loss is multimodal, properties such as the flatness of good solutions, the inductive biases of the network, and biases of the optimizer enable good generalization. 9/10
In knowledge distillation, however, good fidelity is directly aligned with solving what turns out to be an exceptionally difficult optimization problem. See the paper for many more results! 10/10

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrew Gordon Wilson

Andrew Gordon Wilson Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @andrewgwils

Oct 13, 2023
LLMs aren't just next-word predictors, they are also compelling zero-shot time series forecasters! Our new NeurIPS paper:

w/ @gruver_nate, @m_finzi, @ShikaiQiu
1/7 arxiv.org/abs/2310.07820
Image
Naively using LLMs like GPT-3 for time series extrapolation can fail out of the box because of suboptimal tokenization and preprocessing. We show that if we tokenize numbers to individual digits, LLMs really shine!
2/7 Image
We also show that language models are surprisingly natural probabilistic models of continuous data, acting like hierarchical softmax distributions over numbers when tokenized into individual digits. This allows them to fit challenging distributions common in time series.
3/7 Image
Read 7 tweets
Jul 6, 2023
Last year at ICML, we presented marginal likelihood pathologies in model selection and hyper learning. We now have a 60 page JMLR extension featuring: 1) should we be comforted by connections with PAC-Bayes? 2) approximations; 3) architecture search.

1/16 https://t.co/O8xSntGDhvarxiv.org/abs/2202.11678
To recap, the marginal likelihood answers the question "how likely is my prior to generate the training data?" which is fundamentally different than "will my trained model provide good generalization?", leading to many discrepancies. See
2/16
In short, the log marginal likelihood (LML) can underfit, overfit, and heavily penalize diffuse priors that provide good generalization. The decomposition of the LML into a sum of log p(D_i|D<i) suggests a partial remedy, the conditional LML (CLML), removing the first terms. 3/16
Read 16 tweets
Feb 24, 2022
The marginal likelihood (evidence) provides an elegant approach to hypothesis testing and hyperparameter learning, but it has fascinating limits as a generalization proxy, with resolutions.

arxiv.org/abs/2202.11678

w/ @LotfiSanae, @Pavel_Izmailov, @g_benton_, @micahgoldblum 1/23 Image
The search for scientific truth is elusive. How do we select between theories which are entirely consistent with any data we observe? The marginal likelihood p(D|M) -- the probability we would generate our observations from our prior model -- provides a compelling approach. 2/23
MacKay's book, Ch. 28, makes a nice case: a simple model can't generate many datasets, but since p(D|M) is a normalized probability density, it gives high probability to the data it can generate. For a given dataset, the most constrained model wins, encoding "Occam's razor". 3/23 ImageImageImage
Read 23 tweets
Jun 23, 2021
Despite its popularity in the covariate shift setting, Bayesian model averaging can surprisingly hurt OOD generalization! arxiv.org/abs/2106.11905 1/5 Image
Suppose for instance there are dead pixels in an image. The weights attached to these pixels don’t affect the predictions, and so MAP (regularized optimization) drives them to zero. A BMA instead samples these weights from the prior... 2/5
...For in-distribution test data this behaviour doesn’t hurt generalization. But now suppose for example some corruption is added to the image. Now the BMA is activating connections that should be dead, hurting OOD generalization! 3/5
Read 5 tweets
Jun 1, 2021
This is a real problem with the way machine learning is often taught: ML seems like a disjoint laundry list of methods and topics to memorize. But in actuality the material is deeply unified... 1/8
From a probabilistic perspective, whether we are doing supervised, semi-supervised, or unsupervised learning, forming our training objective involves starting with an observation model, turning it into a likelihood, introducing a prior, and then taking our log posterior. 2/8
Our negative log posterior factorizes as -log p(w|D) = -log p(D|w) - log p(w) + c, where 'w' are parameters we want to estimate, and 'D' is the data. For regression with Gaussian noise, our negative log likelihood is squared error. Laplace noise? We get absolute error. 3/8
Read 8 tweets
Apr 30, 2021
What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more.
arxiv.org/abs/2104.14421
With @Pavel_Izmailov, @sharadvikram, and Matthew D. Hoffman. 1/10
We show that Bayesian neural networks reassuringly provide good generalization, outperforming deep ensembles, standard training, and many approximate inference procedures, even with a single chain. 2/10
However, we find that BNNs are surprisingly poor at OOD generalization, even worse than SGD, despite the popularity of approximate inference in this setting, and the relatively good performance of BNNs for OOD detection. 3/10
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(