*Score-based diffusion models*

An emerging approach in generative modelling that is gathering more and more attention.

If you are interested, I collected some introductive material and thoughts in a small thread. 👇

Feel free to weigh in with additional material!

/n
An amazing property of diffusion models is simplicity.

You define a probabilistic chain that gradually "noise" the input image until only white noise remains.

Then, generation is done by learning to reverse this chain. In many cases, the two directions have similar form.

/n
The starting point for diffusion models is probably "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by @jaschasd Weiss @niru_m @SuryaGanguli

Classic paper, definitely worth reading: arxiv.org/abs/1503.03585

/n
A cornerstone in diffusion models is the introduction of "denoising" versions by @hojonathanho @ajayj_ @pabbeel

They showed how to make diffusion models perform close to the state-of-the-art using a suitable reformulation of their training objective.

/n
It turns out that the improved version is also simpler than the original one!

Roughly, it works by adding noise to an image, and learning to denoise the image itself.

In this way, training is connected to denoising autoencoders, and sampling remains incredibly easy.

/n
Denoising diffusion turns out to be similar to "score-based" models, pioneered by @YSongStanford and @StefanoErmon

@YSongStanford has written an outstanding blog post on these ideas, so I'll just skim some of the most interesting connections: yang-song.github.io/blog/2021/scor…

/n
Score-based models work by learning an estimator for the score function of the distribution (ie, the gradient of the log).

Langevin dynamics allows to sample from p(x) having only access to the estimator of the score function.

Reference paper here: arxiv.org/abs/1907.05600

/n
Naive score-based models are uncommon, because sampling starts in poorly approximated regions.

The solution is noise-conditional score models, that perturb the original input, and generate data using "annealed" Langevin dynamics.

arxiv.org/abs/2006.09011
Noise-conditional score-based models and denoised diffusion models are almost equivalent, basically a single family of models.

A few additional improvements obtain performance close to BigGAN on complex datasets, as shown by @prafdhar @unixpickle

arxiv.org/abs/2105.05233

/n
Interestingly, when the noise variance goes from discrete values to a continuous distribution, score-based models connect to neural SDEs and continuous normalizing flows!

This was shown in a #ICLR2021 paper by @YSongStanford @jaschasd @dpkingma Kumar @StefanoErmon @poolio

/n
The field is exploding, too many interesting papers to cite!

For example, a recent one by @YSongStanford @conormdurkan @driainmurray @StefanoErmon shows that a formulation of score-based models is upper-bounding a maximum likelihood objective.

arxiv.org/pdf/2101.09258…

/n
Another personal favorite: multinomial diffusion and argmax flows extend score-based models and flows to discrete data distributions!

by @emiel_hoogeboom @nielsen_didrik @priyankjaini
Forré @wellingmax

arxiv.org/abs/2102.05379

/n
I could go on with my new love, but I'll stop. 🙃

Another nice blog post on score-based models: ajolicoeur.wordpress.com/the-new-conten…

Introductive video by @StefanoErmon:

Lots of code in the blog post by @YSongStanford!

Or you can play w/ github.com/lucidrains/den…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Simone Scardapane

Simone Scardapane Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @s_scardapane

14 Jun
*LocoProp: Enhancing BackProp via Local Loss Optimization*
by @esiamid @_arohan_ & Warmuth

Interesting approach to bridge the gap between first-order, second-order, and "local" optimization approaches. 👇

/n Image
The key idea is to use a single GD step to define auxiliary local targets for each layer, either at the level of pre- or post-activations.

Then, optimization is done by solving local "matching" problems wrt these new variables.

/n Image
What is intriguing is that the framework interpolates between multiple scenarios: first solution step is the original GD, while closed-form solution (in one case) is similar to a pre-conditioned GD model. Optimization is "local" in the sense that it decouples across layers.

/n Image
Read 4 tweets
11 May
*Reproducible Deep Learning*

The first two exercises are out!

We start quick and easily, with some simple manipulation on Git branches, scripting, audio classification, and configuration with @Hydra_Framework.

Small thread with all information 🙃 /n
Reproducibility is associated to production environments and MLOps, but it is a major concern today also in the research community.

My biased introduction to the issue is here: docs.google.com/presentation/d…
The local setup is on the repository: github.com/sscardapane/re…

The use case for the course is a small audio classification model trained on event detection with the awesome @PyTorchLightnin library.

Feel free to check the notebook if you are unfamiliar with the task. /n
Read 8 tweets
11 May
*Weisfeiler and Lehman Go Topological*

Fantastic #ICLR2021 paper by @CristianBodnar @ffabffrasca @wangyg85 @kneppkatt Montúfar @pl219_Cambridge @mmbronstein

Graph networks are limited to pairwise interactions. How to include higher-order components?

Read more below 👇 /n
The paper considers simplicial complexes, nice mathematical objects where having a certain component (e.g., a 3-way interaction in the graph) means also having all the lower level interactions (e.g., all pairwise interactions between the 3 objects). /n
Simplicial complexes have many notions of "adjacency" (four in total), considering lower- and upper- interactions.

They first propose an extension of the Weisfeiler-Lehman test that includes all four of them, showing it is slightly more powerful than standard WL. /n
Read 5 tweets
8 May
*MLP-Mixer: An all-MLP Architecture for Vision*

It's all over Twitter!

A new, cool architecture that mixes several ideas from MLPs, CNNs, ViTs, trying to keep it as simple as possible.

Small thread below. 👇 /n
The idea is strikingly simple:

(i) transform an image into a sequence of patches;
(ii) apply in alternating fashion an MLP on each patch, and on each feature wrt all patches.

Mathematically, it is equivalent to applying an MLP on rows and columns of the matrix of patches. /n
There has been some discussion (and memes!) sparked from this tweet by @ylecun, because several components can be interpreted (or implemented) using convolutive layers (eg, 1x1 convolutions).

So, not a CNN, but definitely not a "simple MLP" either. /n

Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(