Spectacular Image Stylization using CLIP and DALL-E

As a Style Transfer Dude, I can say that this is super cool. A statue of David by Michelangelo was used as an input image. Then it was morphed towards different styles of famous artists by steering the latent code towards...
1/..towards the embeddings of a textual description in CLIP space

I especially like Picasso's Cubism where it created a half-bull half-human portrait which is one of the typical sujets of Picasso. Rene Magritte stylization is my second favorite.

🤙Colab colab.research.google.com/drive/1oA1fZP7…
2/ Original youtube video with more results
3/ I discussed similar techniques for image editing here t.me/gradientdude/2… and here t.me/gradientdude/1…

👇Join my Telegram channel to read about them.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Artsiom Sanakoyeu

Artsiom Sanakoyeu Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @artsiom_s

8 Apr
ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement🔥

This paper proposed an improved way to project real images in the StyleGAN latent space (which is required for further image manipulations).

🌀 yuval-alaluf.github.io/restyle-encode…

Thread 👇 Image
1/ Instead of directly predicting the latent code of a given real image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate. The initial estimate is set to just average latent code across the dataset. ...
2/ Inverting is done using multiple of forward passes by iteratively feeding the encoder with the output of the previous step along with the original input.

Notably, during inference, ReStyle converges its inversion after a small number of steps (e.g., < 5), ...
Read 9 tweets
4 Apr
Self-supervised Learning for Medical images

Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of images for training

The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, ...
Read 9 tweets
1 Apr
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 🔥

Use CLIP model in order to navigate image editing in StyleGAN by text queries.

📝Paper arxiv.org/abs/2103.17249
⚙️ code github.com/orpatashnik/St…

Thread 👇 ImageImage
1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition.
2. Project an input image in StyleGAN latent vector w_s.
... Image

3. Now, given a source latent code w_s∈ W+, and a directive in natural language, or a text prompt t, we iteratively minimize the sum of three losses by changing the latent code w:
a) Distance between generated by StyleGAN image and the text query;
... Image
Read 9 tweets
29 Mar
Swin Transformer: New SOTA backbone for Computer Vision🔥

👉 What?
New vision Transformer architecture called Swin Transformer that can serve as a backbone in computer vision instead of CNNs.

📝 arxiv.org/abs/2103.14030
⚒ Code (soon) github.com/microsoft/Swin…

Thread 👇
There are two main problems with the usage of Transformers for computer vision.
1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes in img)
2. Regular self-attention requires quadratic of the image size number of operations, limiting applications in computer vision where high resolution is necessary (e.g., instance segmentation).
Read 10 tweets
23 Mar
🔥New DALL-E? Paint by Word 🔥

Edit a generated image by painting a mask atany location of the image and specifying any text description. Or generate a full image just based on textual input.

2/ Point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.”
🛠️Two nets:
(1) a semantic similarity network C(x, t) that scores the semantic consistency between an image x and a text description t. It consists of two subnetworks: C_i(x) which embeds images and C_t(t) which embeds text.
(2) generative network G(z) that is trained to ...
Read 16 tweets
23 Mar
Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning

Eliminate region-wise prediction and instead meta-learn object localization and classification at image level in a unified and complementary manner.


1/K ...👇
Specifically, the Meta-DETR first encodes both support and query images into category-specific
features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. ...
Authors propose a Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. ...
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!