Thread by @ai__pub on Thread Reader App

// Git Re-Basin, Explained (Part I) //

Two weeks ago, researchers discovered a way to "merge" ML models, trained on different datasets, at *no* cost to loss!

They also found that that NN loss landscapes effectively contain a single basin.

Why, and how?

Read below:

1/19

The Git Re-Basin paper has two parts:

Part I is about symmetries of neural networks, and how to "align" the weights of two NNs with these symmetries.

Part II shows how to "merge" two models once the weights are aligned, and the limits and implications of merging.

2/19

The starting observation for Git Re-Basin is that neural nets have an *enormous* number of redundant symmetries.

Consider a neural net with a hidden layer consisting of two neurons, A and B.

3/19

If you "swap A with B",

I.e., swap the weights going in and out of A with those going in and out of B,

You get a different neural network - but one that computes the exact same function!

This network with two neurons in the hidden layer has 2! redundant symmetries.

4/19

More generally, an n-neuron hidden layer will exhibit n! symmetries.

So by permuting the weights, a given NN has an astronomical number of equivalent descriptions.

Even a shallow multilayer perceptron has far more of these symmetries than there are atoms in the universe!

5/19

Why does this matter?

If you train a neural net twice with different random seeds, you'll converge to two different weights - W1, and W2.

If you look at W1 and W2 as lists of numbers, they'll look very different.

6/19

But what if they're "the same" weights, just permuted? What if they describe "the same" neural net?

And if they were "the same" weights, how could you tell?

That's Part I of Git Re-Basin!

7/19

In the paper, the authors introduce three methods to bring the weights of two NNs of the same architecture "into alignment" by permuting weights.

These are:
1) Activation matching
2) Weight matching
3) Straight-through estimator

8/19

They find 2), "weight matching", to be accurate enough for their purposes, and note it runs faster than the other methods by orders of magnitude - only a couple seconds on modern hardware.

So I'll go over that one below - read the paper for the others!

9/19

Take two ML models of the same architecture with different weights, W_A, and W_B.

We want to permute the weights on B until W_B is closest to W_A in weight space.

After expanding out terms, we get equivalence with maximizing a sum of cosine similarity terms:

10/19

(The P^T terms show up since, for every permutation you do on *one* side of a hidden layer, you have to undo it on the *other* side).

Unfortunately, finding permutation matrices P_i that maximize the L-term sum above is an NP-hard problem.

11/19

But what if we proceed in a greedy fashion, permuting one layer at a time?

In that case, all but two terms in the sum are constant - and we can transform it into a "linear assignment problem" for which practical, polynomial-time algorithms exist.

12/19

So that's method 2, weight matching!

Greedily advance from the first layers to the last, permuting weights to solve the linear assignment problem specified by a sum of two matrix products.

Algorithm is OOMs faster than the others; runs in seconds on modern hardware.

13/19

What do you get by running this process?

You get two ML models whose weights are "aligned".

Recall that two models can be functionally equivalent, but have very different weights due to symmetries in weight space.

Git Re-Basin undoes these symmetries to "align" models.

14/19

The real fun kicks in after the models are aligned in weight space, and you can perform operations on them.

That's "merging" the models, the main point of the Git Re-Basin paper.

Will cover that in a separate thread in two days!

15/19

To recap Part I:

1) Wide NNs have an insane number of symmetries
2) Therefore ML models can converge to different, but functionally equivalent solutions in weight space
3) Authors find fast, greedy algorithm to "align" two ML models in weight space by permuting layers

16/19

A bit about AI Pub:

Last week we launched a talent network to get engineers hired at the best AI companies. 40 members now!

If you're a software engineer, ML engineer, or ML scientist with 2+ YOE, join here: aipub.pallet.com/talent/welcome…

How we select companies, below:

17/19

We also publish regular "explainer" and "paper-walkthrough" threads like the one you just read.

Here's one on scaling laws and DeepMind's famous Chinchilla paper from a couple weeks ago.

Until next time! 👋

18/19

Last of all: read the Git Re-Basin paper here!

Paper: arxiv.org/abs/2209.04836
Code: github.com/samuela/git-re…
Twitter thread by authors (below):

19/19

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll