Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Gowthami

@gowthami_s

Jan 9, 2023 • 11 tweets • 4 min read • Read on X

Scrolly

InstructPix2Pix: Edit an image using text guidance using a single forward pass. Why use any inversion or other stuff,just create a dataset using inversion techniques and train a new model.

A 🧶

Paper: arxiv.org/abs/2211.09800

Day 8 #30daysofDiffusion #Diffusion #MachineLearning

It should be fast when you want to edit an image in real-time. Models like textual inversion or prompt-to-prompt optimize during inference which makes them slow.

In this paper, the authors cleverly use such techniques to generate the training data and then finetune Stable Diffusion to perform edits in a single forward pass. They use 2 pretrained models, GPT-3 Davinci model and the SD model to generate the data.

What's the need for GPT you might wonder! It is hard to generate a lot of edit instructions manually, so the authors create a small dataset first (700) and fine-tune GPT on it which in turn generates large-scale(>400k) "plausible" edits to the captions.

https://twitter.com/gowthami_s/status/1610659866435289092

Then the authors take the edited caption and run it by Prompt-to-prompt model to generate the modified images. Check out this thread to see how prompt-to-prompt works.

https://twitter.com/gowthami_s/status/1610659866435289092

The data generation pipeline is shown below.

Now we have a dataset of {instruction, image, edited image }. The authors fine-tune the SD model with slight modifications to the encoder module to condition on the original image. The loss objective is shown below. c_i is the original image and c_T is the edit instruction.

Another important point is, SD uses classifier-free guidance, and we want to control how much "guidance" we want in a generation. In this model, there is text as well as image guidance, hence the updated score looks like the following.

s_i controls similarity with the input image, while s_t controls consistency with the edit instruction. We can see the effect of both these hyperparameters in the figure below.

I love this example. Here they show different variants of Abbey Road album covers.

Failure cases: The model is as good as the training data and it carries forward the issues of GPT-3 and Prompt-to-prompt editing. Especially with spatial reasoning stuff (like moving something from left to right etc).

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @gowthami_s

Gowthami

@gowthami_s

Sep 26

🎬 Finally got time to go through the "Video Models Are Zero-Shot Learners and Reasoners" paper. The impressive results aside, I want to thank the GDM team for compiling / sharing a wide range of visual tasks, likely to become a key benchmark in the coming years!
This paper also highlights how thorough Google is in terms of evaluations (tbh it's been evident over the years - flamingo and genie papers also eval on insane number of tasks!) If their eval set is so task-rich - imagine how diverse their training set might've been for training these models. :)

The authors curated around 62 tasks - which are broadly classified into 4 categories: Perception, Modeling, Manipulation and Reasoning. While Veo3 isn't the best model out there for any of these tasks - but its a good generalist model, which performs reasonably well on most of the tasks, without task-specific training! (akin to most generalist LLMs circa 2023)
A 🧵 -

All the 62 tasks and the respective success rates of Veo3 can be seen in this figure. Some of the tasks are very new to me like Dalmation illusion, Conjunctive search etc. For Rorschach blots - is there even a ground truth? 🤔

On classic vision tasks, Veo3 outperforms or on par with Nano-banana (which is a bit surprising!) and significantly better then Veo2

Read 11 tweets

Gowthami

@gowthami_s

Jun 5, 2023

📃🚨 Does your diffusion model copy from the training data? How to find such behavior? Why does it happen? Can we somehow mitigate it?

A summary of recent work on understanding training data replication in recent T2I #diffusion models. A long 🧶

#machinelearning #aigeneration

Paper links
paper 1 -
paper 2 - https://t.co/mc78WKj4uHarxiv.org/abs/2212.03860
arxiv.org/abs/2305.20086

1/ Let's start with the definition of "replication" in our study. We consider something a copy if it is perceptually very similar to all/ majority of training image patches. In the example below, we consider all the yellow highlighted matches as potential copies.

Read 31 tweets

Gowthami

@gowthami_s

Jan 11, 2023

Retrieval Augmented #Diffusion (RDM) models: Smaller diffusion models can generate high-quality generations by accessing an external memory to guide the generation. Inspired by Deepmind's RETRO.

A 🧶

Paper: arxiv.org/abs/2204.11824

Day 10 #30daysofDiffusion #MachineLearning

If the model can rely on this external memory always, it just has to learn important details about the image generation process such as the composition of scenes rather than, for example, remembering how different dogs look like.

Setting: X is the training set and D is a *disjoint* image set which is used for retrieval. θ denotes the parameters of the diffusion model. ξ is the retrieval function which takes in an image and selects "k" images from D. φ is a pretrained image encoder.

Read 18 tweets

Gowthami

@gowthami_s

Jan 10, 2023

StructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.

A 🧵

Paper: arxiv.org/abs/2212.05032

Day 9 #30daysofDiffusion #MachineLearning

T2I models like SD produce great aesthetically pleasing generations for a given prompt, however, most of us never get them right on the first try. Sometimes the model ignores part of the prompt and some objects we want in the picture are missing.

Also sometimes the model gets adjectives mixed up. For example, in the figure below, the prompt is - "red car and white sheep". However, the model produced a red sheep too!

The authors address this compositionality issue in this paper.

Read 13 tweets

Gowthami

@gowthami_s

Jan 2, 2023

DreamBooth: Assign a rare sequence of tokens as the subject's identifier and fine-tune the diffusion model on the small set of images with the "subject". A 🧵

Paper: arxiv.org/abs/2208.12242

Day 1 #30daysofDiffusion #Diffusion #MachineLearning

The authors use the Imagen model in this paper which uses T5-XXL language model to encode the text guidance to generate small 64x64 image first and then use a super-resolution model to blow it up to 1024x1024.

The authors observed that fine-tuning all the modules (including SR module) results in the best performance.

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Gowthami

Try unrolling a thread yourself!

More from @gowthami_s

Gowthami

Gowthami

Gowthami

Gowthami

Gowthami

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!