This is not a fancy novel method. It's plain old distillation.
But we investigate it thoroughly, for model compression, via the lens of *function matching*.
We highligh two crucial principles that are often missed: Consistency and Patience. Only both jointly give good results!
0. Intuition: Want the student to replicate _the whole function_ represented by the teacher, everywhere that we expect data in input space.
This is a much stronger view than the commonly used "teacher generates better/more informative labels for the data". See pic above.
1. Consistency: to achieve this, teacher and student need to see the same view (crop) of the image. For example, this means no pre-computed teacher logits! We can generate many more views via mixup.
Other approaches may look good early, but eventually fall behind consistency.
2. Patience: The function matching task is HARD! We need to train *a lot* longer than typical, and actually we were not able to reach saturation yet. Overfitting does not happen, as when function-matching, an "overfit" student is great! (Note: w/ pre-computed teacher, we overfit)
2b. Excessively long training may mean optim struggle. We try advanced optimization via Shampoo, and get 4x faster convergence.
We believe this setting is a great test-bed for optimizer research: No concern of overfitting, and reducing training error means generalizing better!
3. By distilling a couple large BiT R152x2 models into a ResNet-50, we get a ResNet-50 on ImageNet that gets 82.8% at 224px resolution, and 80.5% at 160px! 😎
No "tricks" just plain distillation, patiently matching functions.
4. Importantly, this simple strategy works on many datasets of various sizes, down to only 1020 training images, where anything else we tried overfit horribly.
Be patient, be consistent, that's it. Eventually, you'll reach or outperform your teacher!
2c. We can't stress patience enough. Multiple strategies, for example initializing the student with a pre-trained model shown here, look promising at first, but eventually plateau and are outperformed by patient, consistent function matching.
5. We have a lot more content. MobileNet students, distilling on on "random other" data (shown below), very thorough baselines, a teacher ensemble, and.... BiT download statistics!
PS: we are working on releasing a bunch of the models, including the best ones, ... but we're also on vacation. Watch github.com/google-researc… and stay tuned, we're aiming for next week!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...
2. Larger ViT are more sample-efficient. L/16 reaches the same accuracy s Ti/16 with about 100x fewer images seen!