Do we still need SGD/Adam to train neural networks? Based on our #NeurIPS2021 paper, we are one step closer to replacing hand-designed optimizers with a single meta-model. Our meta-model can predict parameters for almost any neural network in just one forward pass. (1/n)
For example, our meta-model can predict all ~25M parameters of a ResNet-50 and this ResNet-50 will achieve ~60% on CIFAR-10 without any training. When our meta-model was training, it did not observe any network close to ResNet-50. (2/n)
We can also predict all parameters for ResNet-101, ResNet-152, Wide-ResNets, Visual Transformers, you name it. We use the same meta-model to do that and it works on ImageNet too. (3/n)
Our meta-model predicts all parameters for a given network in less than 1 second on average, even on a CPU! (4/n)
But no free lunch, so depending on the architecture, the predicted parameters will not be very accurate (sometimes same as random). Generally, the further from the training distribution (see the green box in figure), the worse. (5/n)
But even if the classification accuracy of the network with predicted parameters turns out to be bad, don’t get disappointed. You can still use it as a starting point instead of random initialization and benefit in transfer learning, especially in low shot tasks. (6/n)
Being a fan of graph neural networks, you may now guess what’s behind the meta-model we trained. It’s a model based on the previous amazing work on Graph HyperNetworks by Chris Zhang, Mengye Ren and Raquel Urtasun (arxiv.org/abs/1810.05749). (7/n)
We developed and trained a new model, GHN-2, with better generalization abilities. In short, it’s essential to properly normalize predicted parameters, improve long range interaction in graphs and improve convergence by updating GHN parameters on multiple architectures. (8/n)
To train our GHN-2, we introduced a dataset of neural architectures - DeepNets-1M. It has training, validation and testing splits. In addition, we include out-of-distribution testing splits with wider, deeper, denser and normalization-free networks. (9/n)
Our DeepNets-1M can be a nice testbed for benchmarking different graph neural networks (GNNs). With our PyTorch code it should be straightforward to plug in any GNN instead of our Gated GNN. (10/n)
Besides solving our parameter prediction task and being used for network initialization, our GHN-2 can be used for neural architecture search. We searched for the most accurate, most robust (in terms of Gaussian noise), most efficient and easy to train networks. (11/n)
The #NeurIPS2021 paper “Parameter Prediction for Unseen Deep Architectures” and code with the dataset, pretrained GHNs and more are available here arxiv.org/abs/2110.13100 and github.com/facebookresear…
Work with @michal_drozdzal, Graham Taylor and @adri_romsor (12/n)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Boris Knyazev

Boris Knyazev Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(