Lucas Beyer Profile picture
Researcher (Google DeepMind/Brain in Zürich, ex-RWTH Aachen), Gamer, Hacker, Belgian. Mostly gave up trying mastodon as lb@sigmoid.social

Jun 10, 2021, 11 tweets

So you think you know distillation; it's easy, right?

We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva.

Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)

🧵👇arxiv.org/abs/2106.05237

This is not a fancy novel method. It's plain old distillation.

But we investigate it thoroughly, for model compression, via the lens of *function matching*.

We highligh two crucial principles that are often missed: Consistency and Patience. Only both jointly give good results!

0. Intuition: Want the student to replicate _the whole function_ represented by the teacher, everywhere that we expect data in input space.

This is a much stronger view than the commonly used "teacher generates better/more informative labels for the data". See pic above.

1. Consistency: to achieve this, teacher and student need to see the same view (crop) of the image. For example, this means no pre-computed teacher logits! We can generate many more views via mixup.

Other approaches may look good early, but eventually fall behind consistency.

2. Patience: The function matching task is HARD! We need to train *a lot* longer than typical, and actually we were not able to reach saturation yet. Overfitting does not happen, as when function-matching, an "overfit" student is great! (Note: w/ pre-computed teacher, we overfit)

2b. Excessively long training may mean optim struggle. We try advanced optimization via Shampoo, and get 4x faster convergence.

We believe this setting is a great test-bed for optimizer research: No concern of overfitting, and reducing training error means generalizing better!

3. By distilling a couple large BiT R152x2 models into a ResNet-50, we get a ResNet-50 on ImageNet that gets 82.8% at 224px resolution, and 80.5% at 160px! 😎

No "tricks" just plain distillation, patiently matching functions.

4. Importantly, this simple strategy works on many datasets of various sizes, down to only 1020 training images, where anything else we tried overfit horribly.

Be patient, be consistent, that's it. Eventually, you'll reach or outperform your teacher!

2c. We can't stress patience enough. Multiple strategies, for example initializing the student with a pre-trained model shown here, look promising at first, but eventually plateau and are outperformed by patient, consistent function matching.

5. We have a lot more content. MobileNet students, distilling on on "random other" data (shown below), very thorough baselines, a teacher ensemble, and.... BiT download statistics!

PS: we are working on releasing a bunch of the models, including the best ones, ... but we're also on vacation. Watch github.com/google-researc… and stay tuned, we're aiming for next week!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling