Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

AI Pub

@ai__pub

Sep 28, 2022 • 19 tweets • 7 min read • Read on X

Scrolly

// Git Re-Basin, Explained (Part I) //

Two weeks ago, researchers discovered a way to "merge" ML models, trained on different datasets, at *no* cost to loss!

They also found that that NN loss landscapes effectively contain a single basin.

Why, and how?

Read below:

1/19

The Git Re-Basin paper has two parts:

Part I is about symmetries of neural networks, and how to "align" the weights of two NNs with these symmetries.

Part II shows how to "merge" two models once the weights are aligned, and the limits and implications of merging.

2/19

The starting observation for Git Re-Basin is that neural nets have an *enormous* number of redundant symmetries.

Consider a neural net with a hidden layer consisting of two neurons, A and B.

3/19

If you "swap A with B",

I.e., swap the weights going in and out of A with those going in and out of B,

You get a different neural network - but one that computes the exact same function!

This network with two neurons in the hidden layer has 2! redundant symmetries.

4/19

More generally, an n-neuron hidden layer will exhibit n! symmetries.

So by permuting the weights, a given NN has an astronomical number of equivalent descriptions.

Even a shallow multilayer perceptron has far more of these symmetries than there are atoms in the universe!

5/19

Why does this matter?

If you train a neural net twice with different random seeds, you'll converge to two different weights - W1, and W2.

If you look at W1 and W2 as lists of numbers, they'll look very different.

6/19

But what if they're "the same" weights, just permuted? What if they describe "the same" neural net?

And if they were "the same" weights, how could you tell?

That's Part I of Git Re-Basin!

7/19

In the paper, the authors introduce three methods to bring the weights of two NNs of the same architecture "into alignment" by permuting weights.

These are:
1) Activation matching
2) Weight matching
3) Straight-through estimator

8/19

They find 2), "weight matching", to be accurate enough for their purposes, and note it runs faster than the other methods by orders of magnitude - only a couple seconds on modern hardware.

So I'll go over that one below - read the paper for the others!

9/19

Take two ML models of the same architecture with different weights, W_A, and W_B.

We want to permute the weights on B until W_B is closest to W_A in weight space.

After expanding out terms, we get equivalence with maximizing a sum of cosine similarity terms:

10/19

(The P^T terms show up since, for every permutation you do on *one* side of a hidden layer, you have to undo it on the *other* side).

Unfortunately, finding permutation matrices P_i that maximize the L-term sum above is an NP-hard problem.

11/19

But what if we proceed in a greedy fashion, permuting one layer at a time?

In that case, all but two terms in the sum are constant - and we can transform it into a "linear assignment problem" for which practical, polynomial-time algorithms exist.

12/19

So that's method 2, weight matching!

Greedily advance from the first layers to the last, permuting weights to solve the linear assignment problem specified by a sum of two matrix products.

Algorithm is OOMs faster than the others; runs in seconds on modern hardware.

13/19

What do you get by running this process?

You get two ML models whose weights are "aligned".

Recall that two models can be functionally equivalent, but have very different weights due to symmetries in weight space.

Git Re-Basin undoes these symmetries to "align" models.

14/19

The real fun kicks in after the models are aligned in weight space, and you can perform operations on them.

That's "merging" the models, the main point of the Git Re-Basin paper.

Will cover that in a separate thread in two days!

15/19

To recap Part I:

1) Wide NNs have an insane number of symmetries
2) Therefore ML models can converge to different, but functionally equivalent solutions in weight space
3) Authors find fast, greedy algorithm to "align" two ML models in weight space by permuting layers

16/19

https://twitter.com/ai__pub/status/1574824836446109696

A bit about AI Pub:

Last week we launched a talent network to get engineers hired at the best AI companies. 40 members now!

If you're a software engineer, ML engineer, or ML scientist with 2+ YOE, join here: aipub.pallet.com/talent/welcome…

How we select companies, below:

17/19

https://twitter.com/ai__pub/status/1574824836446109696

https://twitter.com/ai__pub/status/1567979703868465154

We also publish regular "explainer" and "paper-walkthrough" threads like the one you just read.

Here's one on scaling laws and DeepMind's famous Chinchilla paper from a couple weeks ago.

Until next time! 👋

18/19

https://twitter.com/ai__pub/status/1567979703868465154

https://twitter.com/SamuelAinsworth/status/1569719494645526529

Last of all: read the Git Re-Basin paper here!

Paper: arxiv.org/abs/2209.04836
Code: github.com/samuela/git-re…
Twitter thread by authors (below):

19/19

https://twitter.com/SamuelAinsworth/status/1569719494645526529

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @ai__pub

AI Pub

@ai__pub

Apr 11, 2023

// AI Recruiting: Survey //

I help ~25 AI startups recruit top-notch engineers, via the AI Pub Talent Network:

Now helping some with their hiring processes.

ML and software engineers: you're invited to interview. Why do you *not* start the hiring process with a company?

1/2

Some reasons that come to mind:

- Not ready / not the right time to leave current role
- Hiring process is long / a PITA
- Cash or equity comp not transparent
- Comp not high enough
- Product, company, or team isn't compelling

Any others?

2/2

Three others that come to mind:
- Don’t want to relocate
- Company isn’t prestigious enough
- Don’t think they’ll pass the interview or get hired (eg I’m not applying for a job at OpenAI b/c it’d be a waste of time)

3/2

Read 4 tweets

AI Pub

@ai__pub

Apr 8, 2023

// Harvey: Legal AGI //

Harvey is an OpenAI-backed GPT-4 startup building AI knowledge workers.

They've signed deals with the largest law firms on earth, and are the fastest-growing LLM startup by revenue I know of.

Everything you need to know about Harvey:

1/10

Harvey's first product is a GPT-4 powered AI knowledge worker.

Harvey can:
- Generate long-form legal documents
- With niche knowledge of the law
- Answer complex legal questions
- Leveraging millions of documents
- Create firm-specific models

2/10

In the last two months, Harvey rolled out multi-million dollar contracts with the largest law firms in the world.

Two examples:
- Allen & Overy (7th largest law firm on Earth): allenovery.com/en-gb/global/n…
- PwC ($50B rev. firm network): pwc.com/gx/en/news-roo…

Dozens coming.

3/10

Read 10 tweets

AI Pub

@ai__pub

Mar 21, 2023

// Deep Papers #3: Toolformer //

LLMs like Bing and ChatGPT use external tools like calculators and web search to answer questions.

How do you teach LLMs to *use* these external tools?

Toolformer shows how!

We interviewed the authors :)

Spotify: open.spotify.com/episode/6uXohG…

LLMs can only spit out the next token, given the context.

How then does an LLM even *use* external tools?

In Toolformer, the authors teach LLMs to output:
- an <API> token,
- followed by a request body,
- followed by a <Call API> token.

The API response is then inserted into the context, including an </API> token.

The LLM then uses that as context to keep making next-token predictions!

That's how Toolformer works.

Read 8 tweets

AI Pub

@ai__pub

Mar 10, 2023

// Toolformer Podcast: Preview //

Today I'm interviewing the Toolformer authors!

LLMs like Bing (and soon, ChatGPT) can use external tools like calculators or internet search to answer questions.

But how do language models *learn to use* these tools?

1/5

I'll publish a thread this weekend explaining how, but for now:

The most interesting question (& hardest part of the problem) is creating the dataset.

2/5

How do you take a large text dataset like Common Crawl,

and annotate it with API calls at the right points,

To form a dataset teaching an LM *when* to make those API calls?

3/5

Read 5 tweets

AI Pub

@ai__pub

Feb 16, 2023

Today: the 7th largest law firm on Earth announced a 3,500-lawyer deal with Harvey, an OpenAI-backed AI Lawyer startup:

See below for:
- Deal details
- Harvey's capabilities (❗)
- Harvey's open roles (I refer talent to them!)

1/6

Allen & Overy, the 2nd-largest law firm in the UK and 7th-largest on Earth, is partnering with Harvey after a 3-month trial of its AI lawyer product.

It is now unrolling Harvey to 3,500+ lawyers in its offices.

Announcement link: allenovery.com/en-gb/global/n…

2/6

Capabilities:

With early access to next-gen text models from OpenAI (😉), Harvey can:

- Answer complex legal questions
- Leveraging millions of documents
- Generate unique work product
- With knowledge of niche law
- Learn from lawyer feedback
- Create firm-specific models

3/6

Read 7 tweets

AI Pub

@ai__pub

Feb 14, 2023

// Podcast #2: Hungry Hungry Hippos (H3) //

Stanford researchers just released a new architecture that:

- Beats Transformers at ~1B param scale
- Admits *much* longer context than Transformers

Is H3 the Transformer-killer? More below!

Spotify: open.spotify.com/episode/45eXtV…

1/5

Hungry Hungry Hippos, aka "H3", functions like a linear RNN, or a long convolution.

The key idea: due to the fast Fourier transform, an H3 layer:

- can be computed in n*log(n) time, with n the context length
- unlike Transformers, which require n^2!

2/5

H3's long context unlocks new AI & product capabilities.

- Long & multifile code generation
- Video understanding
- DNA and genomics
- Long-context chatbots & AI agents

Dan Fu gives an "elevator pitch" for H3 on the podcast:

3/5