With the sequence of noised images = x_1, x_2, ... x_T,
The neural net learns a function f(x,t) that denoises x "a little bit", producing what x would look like at time step t-1.
3/15
To turn pure noise into an HD image, just apply f several times!
The output of a diffusion model really is just
f(f(f(f(....f(N, T), T-1), T-2)..., 2, 1)
where N is pure noise, and T is the number of diffusion steps.
The neural net f is typically implemented as a U-net.
4/15
The key idea behind Stable Diffusion:
Training and computing a diffusion model on large 512 x 512 images is _incredibly_ slow and expensive.
Instead, let's do the computation on _embeddings_ of images, rather than on images themselves.
5/15
So, Stable Diffusion works in two steps.
Step 1: Use an encoder to compress an image "x" into a lower-dimensional, latent-space representation "z(x)"
Step 2: run diffusion and denoising on z(x), rather than x.
Diagram below!
6/15
The latent space representation z(x) has much smaller dimension than the image x.
This makes the _latent_ diffusion model much faster and more expressive than an ordinary diffusion model.
See dimensions from the SD paper:
7/15
But where does the text prompt come in?
I lied! SD does NOT learn a function f(x,t) to denoise x a "little bit" back in time.
It actually learns a function f(x, t, y), with y the "context" to guide the denoising of x.
Below, y is the image label "arctic fox".
8/15
When using Stable Diffusion to make AI art, the "context" y is the text prompt you enter.
The "context" y, alongside the time step t, can be injected into the latent space representation z(x) either by:
1) Simple concatenation 2) Cross-attention
Stable diffusion uses both.
10/15
The cool part not talked about on Twitter: the context mechanism is incredibly flexible.
Instead of y = an image label,
Let y = a masked image, or y = a scene segmentation.
SD trained on this different data, can now do image inpainting and semantic image synthesis!
11/15
(The above inpainting gif isn't from Stable Diffusion, FYI. Just an illustration of inpainting.)
Photos from the SD paper illustrating image inpainting and image synthesis, by changing the "context" representation y:
12/15
That's a wrap on Stable Diffusion! If you read the thread carefully, you understand:
1) The full SD architecture below 2) How SD uses latent space representations 3) how the text prompt is used as "context" 4) how changing the "context" repurposes SD to other tasks.
13/15
If this thread helped you learn about Stable Diffusion, likes, retweets, and follows are appreciated!
In addition to threads like this, I publish a "Best of AI Twitter" thread every week - last week's below.
I help ~25 AI startups recruit top-notch engineers, via the AI Pub Talent Network:
Now helping some with their hiring processes.
ML and software engineers: you're invited to interview. Why do you *not* start the hiring process with a company?
1/2
Some reasons that come to mind:
- Not ready / not the right time to leave current role
- Hiring process is long / a PITA
- Cash or equity comp not transparent
- Comp not high enough
- Product, company, or team isn't compelling
Any others?
2/2
Three others that come to mind:
- Don’t want to relocate
- Company isn’t prestigious enough
- Don’t think they’ll pass the interview or get hired (eg I’m not applying for a job at OpenAI b/c it’d be a waste of time)
3/2
Harvey is an OpenAI-backed GPT-4 startup building AI knowledge workers.
They've signed deals with the largest law firms on earth, and are the fastest-growing LLM startup by revenue I know of.
Everything you need to know about Harvey:
1/10
Harvey's first product is a GPT-4 powered AI knowledge worker.
Harvey can:
- Generate long-form legal documents
- With niche knowledge of the law
- Answer complex legal questions
- Leveraging millions of documents
- Create firm-specific models
2/10
In the last two months, Harvey rolled out multi-million dollar contracts with the largest law firms in the world.
With early access to next-gen text models from OpenAI (😉), Harvey can:
- Answer complex legal questions
- Leveraging millions of documents
- Generate unique work product
- With knowledge of niche law
- Learn from lawyer feedback
- Create firm-specific models