AI Pub Profile picture
Aug 21, 2022 16 tweets 8 min read Read on X
// Stable Diffusion, Explained //

You've seen the Stable Diffusion AI art all over Twitter.

But how does Stable Diffusion _work_?

A thread explaining diffusion models, latent space representations, and context injection:

1/15
First, a one-tweet summary of diffusion models (DMs).

Diffusion is the process of adding small, random noise to an image, repeatedly. (Left-to-right)

Diffusion models reverse this process, turning noise into images, bit-by-bit. (Right-to-left)

Photo credit: @AssemblyAI

2/15
How do DMs turn noise into images?

By training a neural network to do so gradually.

With the sequence of noised images = x_1, x_2, ... x_T,

The neural net learns a function f(x,t) that denoises x "a little bit", producing what x would look like at time step t-1.

3/15
To turn pure noise into an HD image, just apply f several times!

The output of a diffusion model really is just
f(f(f(f(....f(N, T), T-1), T-2)..., 2, 1)
where N is pure noise, and T is the number of diffusion steps.

The neural net f is typically implemented as a U-net.

4/15
The key idea behind Stable Diffusion:

Training and computing a diffusion model on large 512 x 512 images is _incredibly_ slow and expensive.

Instead, let's do the computation on _embeddings_ of images, rather than on images themselves.

5/15
So, Stable Diffusion works in two steps.

Step 1: Use an encoder to compress an image "x" into a lower-dimensional, latent-space representation "z(x)"

Step 2: run diffusion and denoising on z(x), rather than x.

Diagram below!

6/15
The latent space representation z(x) has much smaller dimension than the image x.

This makes the _latent_ diffusion model much faster and more expressive than an ordinary diffusion model.

See dimensions from the SD paper:

7/15
But where does the text prompt come in?

I lied! SD does NOT learn a function f(x,t) to denoise x a "little bit" back in time.

It actually learns a function f(x, t, y), with y the "context" to guide the denoising of x.

Below, y is the image label "arctic fox".

8/15
When using Stable Diffusion to make AI art, the "context" y is the text prompt you enter.

That's how the text prompt works.

(Image credit: @ari_seff's video ).

9/15
But how does SD process context?

The "context" y, alongside the time step t, can be injected into the latent space representation z(x) either by:

1) Simple concatenation
2) Cross-attention

Stable diffusion uses both.

10/15
The cool part not talked about on Twitter: the context mechanism is incredibly flexible.

Instead of y = an image label,

Let y = a masked image, or y = a scene segmentation.

SD trained on this different data, can now do image inpainting and semantic image synthesis!

11/15
(The above inpainting gif isn't from Stable Diffusion, FYI. Just an illustration of inpainting.)

Photos from the SD paper illustrating image inpainting and image synthesis, by changing the "context" representation y:

12/15
That's a wrap on Stable Diffusion! If you read the thread carefully, you understand:

1) The full SD architecture below
2) How SD uses latent space representations
3) how the text prompt is used as "context"
4) how changing the "context" repurposes SD to other tasks.

13/15
If this thread helped you learn about Stable Diffusion, likes, retweets, and follows are appreciated!

In addition to threads like this, I publish a "Best of AI Twitter" thread every week - last week's below.



14/15
PS, for more info check out the Stable Diffusion paper: arxiv.org/abs/2112.10752

15/15
PSS this thread is pretty technical!

Check out these two videos if you want to understand Stable Diffusion at a higher-level/ in less technical format:

16/15

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with AI Pub

AI Pub Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ai__pub

Apr 11, 2023
// AI Recruiting: Survey //

I help ~25 AI startups recruit top-notch engineers, via the AI Pub Talent Network:

Now helping some with their hiring processes.

ML and software engineers: you're invited to interview. Why do you *not* start the hiring process with a company?

1/2
Some reasons that come to mind:

- Not ready / not the right time to leave current role
- Hiring process is long / a PITA
- Cash or equity comp not transparent
- Comp not high enough
- Product, company, or team isn't compelling

Any others?

2/2
Three others that come to mind:
- Don’t want to relocate
- Company isn’t prestigious enough
- Don’t think they’ll pass the interview or get hired (eg I’m not applying for a job at OpenAI b/c it’d be a waste of time)

3/2
Read 4 tweets
Apr 8, 2023
// Harvey: Legal AGI //

Harvey is an OpenAI-backed GPT-4 startup building AI knowledge workers.

They've signed deals with the largest law firms on earth, and are the fastest-growing LLM startup by revenue I know of.

Everything you need to know about Harvey:

1/10 ImageImageImageImage
Harvey's first product is a GPT-4 powered AI knowledge worker.

Harvey can:
- Generate long-form legal documents
- With niche knowledge of the law
- Answer complex legal questions
- Leveraging millions of documents
- Create firm-specific models

2/10 Image
In the last two months, Harvey rolled out multi-million dollar contracts with the largest law firms in the world.

Two examples:
- Allen & Overy (7th largest law firm on Earth): allenovery.com/en-gb/global/n…
- PwC ($50B rev. firm network): pwc.com/gx/en/news-roo…

Dozens coming.

3/10 ImageImageImage
Read 10 tweets
Mar 21, 2023
// Deep Papers #3: Toolformer //

LLMs like Bing and ChatGPT use external tools like calculators and web search to answer questions.

How do you teach LLMs to *use* these external tools?

Toolformer shows how!

We interviewed the authors :)

Spotify: open.spotify.com/episode/6uXohG…
LLMs can only spit out the next token, given the context.

How then does an LLM even *use* external tools?

In Toolformer, the authors teach LLMs to output:
- an <API> token,
- followed by a request body,
- followed by a <Call API> token.
The API response is then inserted into the context, including an </API> token.

The LLM then uses that as context to keep making next-token predictions!

That's how Toolformer works.
Read 8 tweets
Mar 10, 2023
// Toolformer Podcast: Preview //

Today I'm interviewing the Toolformer authors!

LLMs like Bing (and soon, ChatGPT) can use external tools like calculators or internet search to answer questions.

But how do language models *learn to use* these tools?

1/5 ImageImage
I'll publish a thread this weekend explaining how, but for now:

The most interesting question (& hardest part of the problem) is creating the dataset.

2/5 Image
How do you take a large text dataset like Common Crawl,

and annotate it with API calls at the right points,

To form a dataset teaching an LM *when* to make those API calls?

3/5 ImageImage
Read 5 tweets
Feb 16, 2023
Today: the 7th largest law firm on Earth announced a 3,500-lawyer deal with Harvey, an OpenAI-backed AI Lawyer startup:

See below for:
- Deal details
- Harvey's capabilities (❗)
- Harvey's open roles (I refer talent to them!)

1/6
Allen & Overy, the 2nd-largest law firm in the UK and 7th-largest on Earth, is partnering with Harvey after a 3-month trial of its AI lawyer product.

It is now unrolling Harvey to 3,500+ lawyers in its offices.

Announcement link: allenovery.com/en-gb/global/n…

2/6
Capabilities:

With early access to next-gen text models from OpenAI (😉), Harvey can:

- Answer complex legal questions
- Leveraging millions of documents
- Generate unique work product
- With knowledge of niche law
- Learn from lawyer feedback
- Create firm-specific models

3/6
Read 7 tweets
Feb 14, 2023
// Podcast #2: Hungry Hungry Hippos (H3) //

Stanford researchers just released a new architecture that:

- Beats Transformers at ~1B param scale
- Admits *much* longer context than Transformers

Is H3 the Transformer-killer? More below!

Spotify: open.spotify.com/episode/45eXtV…

1/5 ImageImageImageImage
Hungry Hungry Hippos, aka "H3", functions like a linear RNN, or a long convolution.

The key idea: due to the fast Fourier transform, an H3 layer:

- can be computed in n*log(n) time, with n the context length
- unlike Transformers, which require n^2!

2/5 Image
H3's long context unlocks new AI & product capabilities.

- Long & multifile code generation
- Video understanding
- DNA and genomics
- Long-context chatbots & AI agents

Dan Fu gives an "elevator pitch" for H3 on the podcast:

3/5
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(