Do we still need SGD/Adam to train neural networks? Based on our #NeurIPS2021 paper, we are one step closer to replacing hand-designed optimizers with a single meta-model. Our meta-model can predict parameters for almost any neural network in just one forward pass. (1/n)
For example, our meta-model can predict all ~25M parameters of a ResNet-50 and this ResNet-50 will achieve ~60% on CIFAR-10 without any training. When our meta-model was training, it did not observe any network close to ResNet-50. (2/n)
We can also predict all parameters for ResNet-101, ResNet-152, Wide-ResNets, Visual Transformers, you name it. We use the same meta-model to do that and it works on ImageNet too. (3/n)
Our meta-model predicts all parameters for a given network in less than 1 second on average, even on a CPU! (4/n)
But no free lunch, so depending on the architecture, the predicted parameters will not be very accurate (sometimes same as random). Generally, the further from the training distribution (see the green box in figure), the worse. (5/n)
But even if the classification accuracy of the network with predicted parameters turns out to be bad, don’t get disappointed. You can still use it as a starting point instead of random initialization and benefit in transfer learning, especially in low shot tasks. (6/n)
Being a fan of graph neural networks, you may now guess what’s behind the meta-model we trained. It’s a model based on the previous amazing work on Graph HyperNetworks by Chris Zhang, Mengye Ren and Raquel Urtasun (arxiv.org/abs/1810.05749). (7/n)
We developed and trained a new model, GHN-2, with better generalization abilities. In short, it’s essential to properly normalize predicted parameters, improve long range interaction in graphs and improve convergence by updating GHN parameters on multiple architectures. (8/n)
To train our GHN-2, we introduced a dataset of neural architectures - DeepNets-1M. It has training, validation and testing splits. In addition, we include out-of-distribution testing splits with wider, deeper, denser and normalization-free networks. (9/n)
Our DeepNets-1M can be a nice testbed for benchmarking different graph neural networks (GNNs). With our PyTorch code it should be straightforward to plug in any GNN instead of our Gated GNN. (10/n)
Besides solving our parameter prediction task and being used for network initialization, our GHN-2 can be used for neural architecture search. We searched for the most accurate, most robust (in terms of Gaussian noise), most efficient and easy to train networks. (11/n)