Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

Jan 23, 2024 • 13 tweets • 4 min read • Read on X

Scrolly

Happy to share a new paper I worked on!:

"Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers"

abs:
website:

A quick thread about the paper! ↓ (1/11) arxiv.org/abs/2401.11605
crowsonkb.github.io/hourglass-diff…

Before I continue, I want to mention this work was led by @RiversHaveWings, @StefanABaumann, @Birchlabs. @DanielZKaplan, @EnricoShippole were also valuable contributors. (2/11)

High-resolution image synthesis w/ diffusion is difficult without using multi-stage models (ex: latent diffusion). It's even more difficult for diffusion transformers due to O(n^2) scaling. So we want an easily scalable transformer arch for high-res image synthesis. (3/11)

That's exactly what we present with Hourglass Diffusion Transformer (HDiT)!

Our hierarchical transformer arch has an O(n) complexity, enabling it scale well to higher resolutions (4/11)

HDiT relies on merging/downsampling and splitting/upsampling operations implemented with Pixel Shuffling & Unshuffling to enable hierarchical processing of the images at different scales.
Skip connections are implemented using learnable linear interp. (5/11)

Our Transformer blocks utilize recent best practices and tricks for Transformers, like RoPE, cosine similarity self-attention, RMSNorm, GeGLU, etc. These Transformer modifications have previously been minimally explored in the context of diffusion. (6/11)

Finally, what enables the O(n) scaling is the use of local self-attention at the higher resolution blocks in HDiT. While Shifted Window attention (Swin) is a very common form of local attention, we instead find that Neighborhood attention performs better (figure from that paper)

Onto the results!

Our comprehensive ablation study demonstrates that our HDiT arch with transformer tricks (GeGLU, RoPE, etc.) and Neighborhood Attention outperforms DiT whilst incurring fewer FLOPs. (8/11)

We train an 85M param HDiT on FFHQ 1024x1024 and obtain a new SOTA for diffusion models...

The FID doesn't beat StyleGANs but note that FID is often biased towards GAN samples.

Qualitatively, the generations look quite good! (9/11)

We also train a 557M param model on ImageNet 256x256 that performs better than DiT-XL/2 and comparable to other SOTA models. (10/11)

Overall, we believe there is significant promise in this architecture for high-resolution image synthesis.

You can give it a try here!:

(11/11)github.com/crowsonkb/k-di…

https://twitter.com/Birchlabs/status/1749623819113800115

Tweet from co-author @Birchlabs who all of you should follow!

https://twitter.com/Birchlabs/status/1749623819113800115

https://twitter.com/RiversHaveWings/status/1749623266749358492

Tweet from @RiversHaveWings, who originally came up with HDiT (follow her too!):

https://twitter.com/RiversHaveWings/status/1749623266749358492

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @iScienceLuvr

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

Mar 27

"A Manga Guide to DeepSeek-V3 Technical Report"

from now on this is how I will post all papers 🤣

Read 5 tweets

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

Feb 26

Diffusion language models are SO FAST!!

A new startup, Inception Labs, has released Mercury Coder, "the first commercial-scale diffusion large language model"

It's 5-10x faster than current gen LLMs, providing high-quality responses at low costs.

And you can try it now!

The performance is similar to small frontier models while achieving a throughput of ~1000 tokens/sec... on H100s! Reaching this level of throughput for autoregressive LLMs typically requires specialized chips.

It's currently tied for second place on Copilot Arena!

Read 5 tweets

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

Feb 17

Have you heard of Cleo?

Cleo was an account on Math Stack Exchange that was infamous for dropping the answer to the most difficult integrals with no explanation...

often mere minutes after the question was asked!!

For years, no one knew who Cleo was, UNTIL NOW!

People noticed that the same few people were interacting with Cleo (asking the questions Cleo answered, commenting, etc.), a couple of them only active at the same time as Cleo as well.

People were wondering maybe someone is controlling all these accounts as alts

One of the accounts, Laila Podlesny, had an email address associated with it, and by trying to fake log into the Gmail and obtaining the backup recovery email, someone figured out that Vladimir Reshetnikov was in control of Laila Podlesny.

Based on other ineractions from Vladimir on Math.SE, it seemed likely he controlled Cleo, Laila, and couple other accounts as well.

Read 5 tweets

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

May 13, 2024

The livestream demo is not the only cool part about GPT-4o

Remember, GPT-4o is an end-to-end trained multimodal model!

No one is reading the GPT-4o blog post which highlights so many other cool features

SEE MORE FEATURES GPT-4o HAS ↓

https://twitter.com/iScienceLuvr/status/1790082827016364071

First of all, GPT-4o is a much better language model. It's SOTA on a variety of LLM benchmarks:

https://twitter.com/iScienceLuvr/status/1790082827016364071

https://twitter.com/sama/status/1790066003113607626

And also good at chat arena evals

https://twitter.com/sama/status/1790066003113607626

Read 11 tweets

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

May 8, 2024

AlphaFold3 is out!

This a diffusion model pipeline that goes beyond what AlphaFold2 did: predicting the structures of protein-molecule complexes containing DNA, RNA, ions, etc.

Blog post:
Paper:

A quick thread about the method↓blog.google/technology/ai/…
nature.com/articles/s4158…

AlphaFold2 was impactful but had one major limitation: it could only predict structures of proteins by itself.

In reality, proteins have various modifications, bind to other molecules, form complexes w/ DNA, RNA, etc.

Structure of these complexes can't be predicted by AF2

AF3 is similar to AF2, utilizing Template, MSA & Pairformer (similar to Evoformer from AF2) modules

However, amino acid + DNA/RNA/ion/ligand/post-translational modifications can be passed in unlike AF2

Also, the structure is directly generated with a diffusion model (3/11)

Read 12 tweets

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

Apr 30, 2024

Google announces Med-Gemini, a family of Gemini models fine-tuned for medical tasks! 🔬

Achieves SOTA on 10 of the 14 benchmarks, spanning text, multimodal & long-context applications.

Surpasses GPT-4 on all benchmarks!

This paper is super exciting, let's dive in ↓

The team developed a variety of model variants. First let's talk about the models they developed for language tasks.

The finetuning dataset is quite similar to Med-PaLM2, except with one major difference:

self-training with search

(2/14)

The goal is to improve clinical reasoning and ability to use search results.

Synthetic chain-of-thought w/ and w/o search results in context are generated, incorrect preds are filtered out, the model is trained on those CoT, and then the synthetic CoT is regenerated

(3/14)

Read 15 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Tanishq Mathew Abraham, Ph.D.

Try unrolling a thread yourself!

More from @iScienceLuvr

Tanishq Mathew Abraham, Ph.D.

Tanishq Mathew Abraham, Ph.D.

Tanishq Mathew Abraham, Ph.D.

Tanishq Mathew Abraham, Ph.D.

Tanishq Mathew Abraham, Ph.D.

Tanishq Mathew Abraham, Ph.D.

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!