Edit a generated image by painting a mask atany location of the image and specifying any text description. Or generate a full image just based on textual input.
2/ Point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.”
3/
🛠️Two nets: (1) a semantic similarity network C(x, t) that scores the semantic consistency between an image x and a text description t. It consists of two subnetworks: C_i(x) which embeds images and C_t(t) which embeds text. (2) generative network G(z) that is trained to ...
4/ ...to synthesize realistic images given a random z; this network enforces realism.
We generate a realistic image G(z) that matches descriptive text t by optimizing:
z∗ = arg min_z L_sem(z) = arg min_z C(G(z), t)
5/
To focus on changes in a local area, we direct the matching network C to attend to only the region of the user’s brushstroke instead of the whole image. To do this we extract the latent representation w=f(z) of the image and ...
6/ ... and mask it using the user's input and optimize only the masked region of the representation. To match the input textual description: we embed the output image and the text using networks C_i(x) and C_t(t) and maximize the similarity between these embeddings ...
7/ ...by backpropagating the gradients to the masked latent representation w.
Here is the loss ablation study. Mask the outout image vs mask the latent representation for backprop.
8/
Full image generation:
"Paint by Word" ⚔️vs DALL-E
The poposed method has a simpler architecturethan DALL-E and it does not explicitly train the generator to take textual description as inpu to the generatort. The textual information comes only from the semantic loss.
9/
For G authors train a 256-pixel StyleGAN2 on CUB dataset. And for C(x, t) authors use use an off the-shelf CLIP model.
The network is trained only on birds and it utterly fails to draw any other type of subject. Because of this narrow focus, it is unsurprising ...
10/ that it might be better at drawing realistic bird images than the DALL-E model, which is trained on a far broader variety of unconstrained images. Nevertheless, this experiment demonstrates that it is possible to obtain state-of-the-art semantic consistency, ...
11/
at least within a narrow image domain, without explicitly training the generator to take information
about the textual concept as input.
More resutls when trained generator G(z) on ImageNet or Places:
12/
☑️To conclude, this paper shows that even such a simple method can produce pretty amazing results.
🔥Just train your styleGAN / BigGAN generator and then to edit an image region just optimize the masked latent code using pretrained CLIP as a loss. That's it!
Subscribe to my Telegram channel not to miss other novel paper reviews like this! 😉 t.me/gradientdude
Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning
❓How?
Eliminate region-wise prediction and instead meta-learn object localization and classification at image level in a unified and complementary manner.
Specifically, the Meta-DETR first encodes both support and query images into category-specific
features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. ...
2/K
Authors propose a Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. ...
3/K
2/ It is the largest (afaik) publicly available GPT-3 replica. The primary goal of this project is to replicate a full-sized GPT-3 model and open source it to the public, for free.
The models were trained on an open-source dataset The Pile pile.eleuther.ai which ...
To learn about differences between the two -> thread 👇
1/ The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions p=(x,y,z) of the voxel and one that depends only on the ray directions v.
Essentially you predict K different (R,G,B) values for ever voxel...
2/ Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars H_i(v) for each of them:
color(x,y,z) = RGB_1 * H_1 + RGB_2 * H_2 + ... + RGB_K * H_K. This is inspired by the rendering equation.
...
... they train a regressor network to predict
the latent code from an input image. To teach the regressor to predict the latent code for images w/ missing pixels they mask random patches during training.
Now, given an input collage, the regressor projects it into a reasonable...