Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

AI Pub

@ai__pub

Aug 21, 2022 • 16 tweets • 8 min read • Read on X

Scrolly

// Stable Diffusion, Explained //

You've seen the Stable Diffusion AI art all over Twitter.

But how does Stable Diffusion _work_?

A thread explaining diffusion models, latent space representations, and context injection:

1/15

@AssemblyAI

First, a one-tweet summary of diffusion models (DMs).

Diffusion is the process of adding small, random noise to an image, repeatedly. (Left-to-right)

Diffusion models reverse this process, turning noise into images, bit-by-bit. (Right-to-left)

Photo credit: @AssemblyAI

2/15

How do DMs turn noise into images?

By training a neural network to do so gradually.

With the sequence of noised images = x_1, x_2, ... x_T,

The neural net learns a function f(x,t) that denoises x "a little bit", producing what x would look like at time step t-1.

3/15

To turn pure noise into an HD image, just apply f several times!

The output of a diffusion model really is just
f(f(f(f(....f(N, T), T-1), T-2)..., 2, 1)
where N is pure noise, and T is the number of diffusion steps.

The neural net f is typically implemented as a U-net.

4/15

The key idea behind Stable Diffusion:

Training and computing a diffusion model on large 512 x 512 images is _incredibly_ slow and expensive.

Instead, let's do the computation on _embeddings_ of images, rather than on images themselves.

5/15

So, Stable Diffusion works in two steps.

Step 1: Use an encoder to compress an image "x" into a lower-dimensional, latent-space representation "z(x)"

Step 2: run diffusion and denoising on z(x), rather than x.

Diagram below!

6/15

The latent space representation z(x) has much smaller dimension than the image x.

This makes the _latent_ diffusion model much faster and more expressive than an ordinary diffusion model.

See dimensions from the SD paper:

7/15

But where does the text prompt come in?

I lied! SD does NOT learn a function f(x,t) to denoise x a "little bit" back in time.

It actually learns a function f(x, t, y), with y the "context" to guide the denoising of x.

Below, y is the image label "arctic fox".

8/15

@ari_seff

When using Stable Diffusion to make AI art, the "context" y is the text prompt you enter.

That's how the text prompt works.

(Image credit: @ari_seff's video ).

9/15

But how does SD process context?

The "context" y, alongside the time step t, can be injected into the latent space representation z(x) either by:

1) Simple concatenation
2) Cross-attention

Stable diffusion uses both.

10/15

The cool part not talked about on Twitter: the context mechanism is incredibly flexible.

Instead of y = an image label,

Let y = a masked image, or y = a scene segmentation.

SD trained on this different data, can now do image inpainting and semantic image synthesis!

11/15

(The above inpainting gif isn't from Stable Diffusion, FYI. Just an illustration of inpainting.)

Photos from the SD paper illustrating image inpainting and image synthesis, by changing the "context" representation y:

12/15

That's a wrap on Stable Diffusion! If you read the thread carefully, you understand:

1) The full SD architecture below
2) How SD uses latent space representations
3) how the text prompt is used as "context"
4) how changing the "context" repurposes SD to other tasks.

13/15

https://twitter.com/ai__pub/status/1560334887927701505

If this thread helped you learn about Stable Diffusion, likes, retweets, and follows are appreciated!

In addition to threads like this, I publish a "Best of AI Twitter" thread every week - last week's below.

https://twitter.com/ai__pub/status/1560334887927701505

14/15

PS, for more info check out the Stable Diffusion paper: arxiv.org/abs/2112.10752

15/15

https://twitter.com/ai__pub/status/1561147847092817920

PSS this thread is pretty technical!

Check out these two videos if you want to understand Stable Diffusion at a higher-level/ in less technical format:

16/15

https://twitter.com/ai__pub/status/1561147847092817920

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @ai__pub

AI Pub

@ai__pub

Apr 11, 2023

// AI Recruiting: Survey //

I help ~25 AI startups recruit top-notch engineers, via the AI Pub Talent Network:

Now helping some with their hiring processes.

ML and software engineers: you're invited to interview. Why do you *not* start the hiring process with a company?

1/2

Some reasons that come to mind:

- Not ready / not the right time to leave current role
- Hiring process is long / a PITA
- Cash or equity comp not transparent
- Comp not high enough
- Product, company, or team isn't compelling

Any others?

2/2

Three others that come to mind:
- Don’t want to relocate
- Company isn’t prestigious enough
- Don’t think they’ll pass the interview or get hired (eg I’m not applying for a job at OpenAI b/c it’d be a waste of time)

3/2

Read 4 tweets

AI Pub

@ai__pub

Apr 8, 2023

// Harvey: Legal AGI //

Harvey is an OpenAI-backed GPT-4 startup building AI knowledge workers.

They've signed deals with the largest law firms on earth, and are the fastest-growing LLM startup by revenue I know of.

Everything you need to know about Harvey:

1/10

Harvey's first product is a GPT-4 powered AI knowledge worker.

Harvey can:
- Generate long-form legal documents
- With niche knowledge of the law
- Answer complex legal questions
- Leveraging millions of documents
- Create firm-specific models

2/10

In the last two months, Harvey rolled out multi-million dollar contracts with the largest law firms in the world.

Two examples:
- Allen & Overy (7th largest law firm on Earth): allenovery.com/en-gb/global/n…
- PwC ($50B rev. firm network): pwc.com/gx/en/news-roo…

Dozens coming.

3/10

Read 10 tweets

AI Pub

@ai__pub

Mar 21, 2023

// Deep Papers #3: Toolformer //

LLMs like Bing and ChatGPT use external tools like calculators and web search to answer questions.

How do you teach LLMs to *use* these external tools?

Toolformer shows how!

We interviewed the authors :)

Spotify: open.spotify.com/episode/6uXohG…

LLMs can only spit out the next token, given the context.

How then does an LLM even *use* external tools?

In Toolformer, the authors teach LLMs to output:
- an <API> token,
- followed by a request body,
- followed by a <Call API> token.

The API response is then inserted into the context, including an </API> token.

The LLM then uses that as context to keep making next-token predictions!

That's how Toolformer works.

Read 8 tweets

AI Pub

@ai__pub

Mar 10, 2023

// Toolformer Podcast: Preview //

Today I'm interviewing the Toolformer authors!

LLMs like Bing (and soon, ChatGPT) can use external tools like calculators or internet search to answer questions.

But how do language models *learn to use* these tools?

1/5

I'll publish a thread this weekend explaining how, but for now:

The most interesting question (& hardest part of the problem) is creating the dataset.

2/5

How do you take a large text dataset like Common Crawl,

and annotate it with API calls at the right points,

To form a dataset teaching an LM *when* to make those API calls?

3/5

Read 5 tweets

AI Pub

@ai__pub

Feb 16, 2023

Today: the 7th largest law firm on Earth announced a 3,500-lawyer deal with Harvey, an OpenAI-backed AI Lawyer startup:

See below for:
- Deal details
- Harvey's capabilities (❗)
- Harvey's open roles (I refer talent to them!)

1/6

Allen & Overy, the 2nd-largest law firm in the UK and 7th-largest on Earth, is partnering with Harvey after a 3-month trial of its AI lawyer product.

It is now unrolling Harvey to 3,500+ lawyers in its offices.

Announcement link: allenovery.com/en-gb/global/n…

2/6

Capabilities:

With early access to next-gen text models from OpenAI (😉), Harvey can:

- Answer complex legal questions
- Leveraging millions of documents
- Generate unique work product
- With knowledge of niche law
- Learn from lawyer feedback
- Create firm-specific models

3/6

Read 7 tweets

AI Pub

@ai__pub

Feb 14, 2023

// Podcast #2: Hungry Hungry Hippos (H3) //

Stanford researchers just released a new architecture that:

- Beats Transformers at ~1B param scale
- Admits *much* longer context than Transformers

Is H3 the Transformer-killer? More below!

Spotify: open.spotify.com/episode/45eXtV…

1/5

Hungry Hungry Hippos, aka "H3", functions like a linear RNN, or a long convolution.

The key idea: due to the fast Fourier transform, an H3 layer:

- can be computed in n*log(n) time, with n the context length
- unlike Transformers, which require n^2!

2/5

H3's long context unlocks new AI & product capabilities.

- Long & multifile code generation
- Video understanding
- DNA and genomics
- Long-context chatbots & AI agents

Dan Fu gives an "elevator pitch" for H3 on the podcast:

3/5