Gowthami Profile picture
Multimodal research | Past - UMD, MetaAI, Amazon, IIT Madras | Rants, Memes my own.
Sep 26 • 11 tweets • 6 min read
šŸŽ¬ Finally got time to go through the "Video Models Are Zero-Shot Learners and Reasoners" paper. The impressive results aside, I want to thank the GDM team for compiling / sharing a wide range of visual tasks, likely to become a key benchmark in the coming years!
This paper also highlights how thorough Google is in terms of evaluations (tbh it's been evident over the years - flamingo and genie papers also eval on insane number of tasks!) If their eval set is so task-rich - imagine how diverse their training set might've been for training these models. :)

The authors curated around 62 tasks - which are broadly classified into 4 categories: Perception, Modeling, Manipulation and Reasoning. While Veo3 isn't the best model out there for any of these tasks - but its a good generalist model, which performs reasonably well on most of the tasks, without task-specific training! (akin to most generalist LLMs circa 2023)
A 🧵 - All the 62 tasks and the respective success rates of Veo3 can be seen in this figure. Some of the tasks are very new to me like Dalmation illusion, Conjunctive search etc. For Rorschach blots - is there even a ground truth? šŸ¤” Image
Jun 5, 2023 • 31 tweets • 8 min read
šŸ“ƒšŸšØ Does your diffusion model copy from the training data? How to find such behavior? Why does it happen? Can we somehow mitigate it?

A summary of recent work on understanding training data replication in recent T2I #diffusion models. A long 🧶

#machinelearning #aigeneration Paper links
paper 1 -
paper 2 - https://t.co/mc78WKj4uHarxiv.org/abs/2212.03860
arxiv.org/abs/2305.20086
Jan 11, 2023 • 18 tweets • 6 min read
Retrieval Augmented #Diffusion (RDM) models: Smaller diffusion models can generate high-quality generations by accessing an external memory to guide the generation. Inspired by Deepmind's RETRO.

A 🧶

Paper: arxiv.org/abs/2204.11824

Day 10 #30daysofDiffusion #MachineLearning Image If the model can rely on this external memory always, it just has to learn important details about the image generation process such as the composition of scenes rather than, for example, remembering how different dogs look like.
Jan 10, 2023 • 13 tweets • 5 min read
StructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.

A 🧵

Paper: arxiv.org/abs/2212.05032

Day 9 #30daysofDiffusion #MachineLearning T2I models like SD produce great aesthetically pleasing generations for a given prompt, however, most of us never get them right on the first try. Sometimes the model ignores part of the prompt and some objects we want in the picture are missing.
Jan 9, 2023 • 11 tweets • 4 min read
InstructPix2Pix: Edit an image using text guidance using a single forward pass. Why use any inversion or other stuff,just create a dataset using inversion techniques and train a new model.

A 🧶

Paper: arxiv.org/abs/2211.09800

Day 8 #30daysofDiffusion #Diffusion #MachineLearning Image It should be fast when you want to edit an image in real-time. Models like textual inversion or prompt-to-prompt optimize during inference which makes them slow.
Jan 2, 2023 • 9 tweets • 4 min read
DreamBooth: Assign a rare sequence of tokens as the subject's identifier and fine-tune the diffusion model on the small set of images with the "subject". A 🧵

Paper: arxiv.org/abs/2208.12242

Day 1 #30daysofDiffusion #Diffusion #MachineLearning Image The authors use the Imagen model in this paper which uses T5-XXL language model to encode the text guidance to generate small 64x64 image first and then use a super-resolution model to blow it up to 1024x1024.