How to get URL link on X (Twitter) App
Paper links
If the model can rely on this external memory always, it just has to learn important details about the image generation process such as the composition of scenes rather than, for example, remembering how different dogs look like.
T2I models like SD produce great aesthetically pleasing generations for a given prompt, however, most of us never get them right on the first try. Sometimes the model ignores part of the prompt and some objects we want in the picture are missing.
It should be fast when you want to edit an image in real-time. Models like textual inversion or prompt-to-prompt optimize during inference which makes them slow.
The authors use the Imagen model in this paper which uses T5-XXL language model to encode the text guidance to generate small 64x64 image first and then use a super-resolution model to blow it up to 1024x1024.