Diffusion models like #DALLE and #StableDiffusion are state of the art for image generation, yet our understanding of them is in its infancy. This thread introduces the basics of how diffusion models work, how we understand them, and why I think this understanding is broken.🧵
Diffusion models are powerful image generators, but they are built on two simple components: a function that degrades images by adding Gaussian noise, and a simple image restoration network for removing this noise.
We create training data for the restoration network by adding Gaussian noise to clean images. The model accepts a noisy image as input and spits out a cleaned image. We train by minimizing a loss that measures the L1 difference between the original image and the denoised output.
These denoising nets are quite powerful. In fact, they are so powerful that we can hand them an array of pure noise and they will restore it to an image. Every time we hand it a different noise array, we get back a different image. And there we have it - an image generator!
Err….well…sort of. You may have noticed that this generator doesn't work so well. The image looks really blurry and has no details. This behavior is expected though because the L1 loss function is bad for severe denoising. Here's why...
When a model is trained with severe noise, it can’t tell exactly where edges should be in an image. If it puts an edge in the wrong place, it will incur a large loss. For this reason, it minimizes the loss by smoothing over ambiguous object boundaries and removing fine details.
Of course the severity of this over-smoothing depends on how noisy the training data is. A model trained on mild-noise images like this one can accurately tell where object edges are located. It learns to minimize the loss by restoring sharp edges rather than blurring them out.
So how can we generate good images? First, use a severe noise model to convert pure noise to a blurry image. Then feed this blurry image to a mild-noise model that outputs sharp images. The mild-noise model expects noisy inputs though, so we add noise to the blurry image first.
Here's the process in detail: The denoiser converts pure noise to a blurry image. We then add some noise back to this image, and feed it to a model trained with lower noise levels, which creates a less blurry image. Add some noise back, and denoise again...and again.
We repeat this process using progressively lower noise levels until the noise is zero. We now have a refined output image with sharp edges and features. This iteration process escapes the limitations of the Lp-norm loss on which our models were trained.
What about those fancy models that make images from text descriptions, like DALLE and GLIDE and Stable Diffusion? These use similar denoising models, but with two inputs. At train time, a clean image is degraded and handed to the denoising model for training, just like usual.
At the same time, a caption describing the image is pushed through a language model and converted to embedded features, which are then provided as an additional input to the denoiser. Training and generation proceed just like before, but with text inputs providing hints.
Theoreticians understand diffusion as a method for using noise to explore an image distribution. The denoising step can be interpreted as a method for taking a noisy image and moving it closer to the natural image manifold using gradient ascent on the image density function.
When these denoising steps are alternated with steps that add noise, we get a classical process called Langevin Diffusion in which iterates bounce around the image distribution. When this process runs for long enough, the iterates behave like samples from the true distribution.
So why is this understanding broken? Existing theories of diffusion rely strongly on properties of Gaussian noise. They also require a source of randomness in the image generator that slowly sweeps from a “hot” noisy phase to a “cold” deterministic phase.
However, my lab has recently observed that generative models can be built from any image degradation, not just noise. Here's an example in which images are degraded using heavy synthetic snow (from ImageNet-C). By iteratively removing and adding snow, we can restore the image.
Snow and animorphosis (above) are fun curiosities, but in practice we might want diffusion processes for inverting real-world image degradations, like blur, pixelation, desaturation, etc. By swapping noise with arbitrary transforms, we get diffusions that invert almost anything.
These generalized diffusions work great, and yet they violate every existing theory of diffusion, all of which rely strongly on the use of Gaussian noise. Some of these are even “cold” diffusions that require no source of randomness at all. arxiv.org/abs/2208.09392
Appendix: If you want to learn more, here’s a reading list that covers diffusion topics.
Why have diffusion models displaced GANs so quickly? Consider the tale of the (very strange) first DALLE model. In 2021, diffusions were almost unheard of, yet the creators of DALLE had already rejected the GAN approach. Here’s why. 🧵
DALLE is an image model, but it was built like a language model. The model trained on image-caption pairs. Captions were encoded as 256 tokens. Images were broken into a 32x32 grid of patches, which were each encoded as a token. All tokens were merged into a single sequence.
A transformer-based "language" model was trained on these sequences, ignoring the fact that some tokens represent text and some represent patches. The model reads in a partial sequence of tokens, and predicts the next token in the sequence.
SSIM has become a common loss function in computer vision. It is used to train monocular depth models for self-driving cars, invert GANs, and fit NeRF models to training images. The explosion of SSIM-based models raises a fundamental question: what the hell is SSIM? 🧵
SSIM measures the similarity between two images. Humans are insensitive to the absolute brightness/color of pixels, but very sensitive to the location of edges and textures. SSIM mimics human perception by focusing primarily on edge and textural similarities.
Here’s an example. The contrast adjustment between these two images of #IngridDaubechies makes them 20% different when measured using the 2-norm. But in the SSIM metric they are 98.5% similar (1.5% different).
Just how much have language models grown in the last 4 years? Let's have a look. In 2018, the puny BERT “large” model premiered with a measly 354M parameters. It can be trained on a single 8xA100 node in 5 days. That costs $2K on AWS - almost free by LLM standards! 🧵
Then came Facebook’s equally tiny RoBERTa model. Built on BERT-large, but with mods for faster mixed-precision training, it completed 40 epochs on its beefed up training set using 1000 GPUs for a week. You could train this on the cloud for $350K. NBD.
GPT-3 has a modest but respectable 175B parameters. It was trained with roughly 1500 GPUs for 2 months. On AWS, you could train this for a cool $3M.
"Plug-In" inversion directly produces images from ViTs and CNNs at the pixel level, with no GAN prior. We then see what networks really care about, not just what the GANs want us to see. Here's a few examples. First, I'll pull you in with these tugboats...
My student @aminjuun has been working like a dog on this project. This dog, specifically.
Lately, it feels like there's been a volcano of research on vision transformers. Here's what the ViTs think of that...
My recent talk at the NSF town hall focused on the history of the AI winters, how the ML community became "anti-science," and whether the rejection of science will cause a winter for ML theory. I'll summarize these issues below...🧵
Frank Rosenblatt's hardware implementation of perceptrons solved very simple OCR problems. After it was proved that shallow perceptrons could not solve certain logic problems, the community soured on this approach, causing the winter of '69.
This caused a turning away from vision problems and towards text systems (e.g. ELIZA) and planning (e.g. A* search). In 1973, James Lighthill wrote a report for the British government claiming that progress on language systems and robotics had stalled, causing the second winter.
If you want to understand why TensorFlow is the way it is, you have to go back to the ancient times. In 2012, Google created a system called DistBelief that laid out their vision for how large-scale training would work. It was the basis for TF. 🧵 research.google/pubs/pub40565/
In DistBelief, both models and datasets were split across nodes. Worker nodes update only a subset of parameters at a time, and communicate parameters asyncronously to a "parameter server". A "coordinator" orchestrates the independent model, data, and parameter nodes.
Here's a description of the *simplest* operation mode of this system, taken directly from Jeff's paper.