NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
abs: arxiv.org/abs/2111.12417
presents a unified multimodal pretrained model that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks
NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks
selected the subset of datasets (the top 11 rows in Figure 2, below) from The Pile that we found to be of the highest relative quality. Then, following a similar approach as that used to generate Pile-CC, we downloaded and filtered two recent Common Crawl (CC) snapshots.
based evaluation setting on the open-source project lm-evaluation-harness and made task-specific changes as appropriate to align settings more closely with prior work. evaluated MT-NLG in zero-, one-, and few-shot settings without performing search for the optimal number of shots
VQGAN + CLIP "matte painting of a city built on top of a giant turtle walking slowly towards the viewer with clear blue skies and a lush green landscape | trending on artstation" + 3D photo inpainting