ππ¨ππ¨
NN loss landscapes are full of permutation symmetries, ie. swap any 2 units in a hidden layer. What does this mean for SGD? Is this practically useful?
For the past 5 yrs these Qs have fascinated me. Today, I am ready to announce "Git Re-Basin"!
We show that NN loss landscapes contain effectively only a single basin(!) provided sufficient width. Even better, we develop practical algos to navigate these basins...
Say you train Model A.
Independently, your friend trains Model B, possibly on different data.
With Git Re-Basin, you can merge models A+B in weight space at _no cost to the loss_
Git Re-Basin applies to any NN arch & we provide the first-ever demonstration of zero-barrier linear mode connectivity between two independently trained (no pre-training!) ResNets.
Put simply: a ResNet loss landscape contains only a single basin & we have algo to prove it
Phenomenon #1: "merge-ability" is an emergent property of SGD training -> merging at init doesn't work but a phase transition occurs such that it becomes possible over time
Phenomenon #2: Model width is intimately related to merge-ability: the wider the better. Not too burdensome of a constraint since we're all training in the wide/overparameterized regime anyways. Important nonetheless...
Also, not all arch's are equally mergeable: VGGs seem to be harder than ResNets π€·ββοΈ We hypothesize that merge-ability is an indicator of compatible data/arch fit.
Finally, my fav result: it's possible to train models on disjoint and biased datasets, then merge them together in weight space.
Eg, you have some data in US, some in EU. Can't mix data due to GDPR etc. Train separate models, merge weights -> generalize to the combined dataset!
So there ya go: it's possible to mix trained models like mixing potions, no pre-training or fine-tuning necessary.
That said, there are still loads of open questions left! I'm v curious to see where LMC and model patching work goes in the future π
Also plenty of exciting possible applications to federated learning, distributed training, deep learning optimization, and so forth
Ok, that's enough for one thread... Check out algos, counterexamples, proofs, and more in
Prediction: @Microsoft will launch an AI assistant product in the next 5 years, built on ChatGPT. It will blow Google Assistant, Amazon Alexa, etc out of the water π
) but with ChatGPT and working with any app running on Windows...
MSFT already has the necessary puzzle pieces in place: 1. Exclusive access to ChatGPT/related tech, thanks to their close partnership with @OpenAI
2. Enough cloud infra and capital to support running ML models for millions of users (@Azure) 3. A rich app and developer ecosystem that they control top to bottom (.NET, Windows Dev ecosys.) 4. Hardware chops (from Surface, etc) matching or exceeding G Nest, Alexa, and the rest