Tweet

@threadreaderapp

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @artsiom_s

Artsiom Sanakoyeu

@artsiom_s

6 Apr

Spectacular Image Stylization using CLIP and DALL-E

As a Style Transfer Dude, I can say that this is super cool. A statue of David by Michelangelo was used as an input image. Then it was morphed towards different styles of famous artists by steering the latent code towards...
👇

1/..towards the embeddings of a textual description in CLIP space

I especially like Picasso's Cubism where it created a half-bull half-human portrait which is one of the typical sujets of Picasso. Rene Magritte stylization is my second favorite.

🤙Colab colab.research.google.com/drive/1oA1fZP7…

2/ Original youtube video with more results

Read 4 tweets

Artsiom Sanakoyeu

@artsiom_s

4 Apr

Self-supervised Learning for Medical images

Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of images for training

arxiv.org/abs/2102.10680

1/
The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
...

2/
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, ...

Read 9 tweets

Artsiom Sanakoyeu

@artsiom_s

1 Apr

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 🔥

Use CLIP model in order to navigate image editing in StyleGAN by text queries.

📝Paper arxiv.org/abs/2103.17249
⚙️ code github.com/orpatashnik/St…

Thread 👇

1/
🛠️How?
1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition.
2. Project an input image in StyleGAN latent vector w_s.
...

2/

3. Now, given a source latent code w_s∈ W+, and a directive in natural language, or a text prompt t, we iteratively minimize the sum of three losses by changing the latent code w:
a) Distance between generated by StyleGAN image and the text query;
...

Read 9 tweets

Artsiom Sanakoyeu

@artsiom_s

29 Mar

Swin Transformer: New SOTA backbone for Computer Vision🔥

👉 What?
New vision Transformer architecture called Swin Transformer that can serve as a backbone in computer vision instead of CNNs.

📝 arxiv.org/abs/2103.14030
⚒ Code (soon) github.com/microsoft/Swin…

Thread 👇

2/
❓Why?
There are two main problems with the usage of Transformers for computer vision.
1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes in img)

3/
2. Regular self-attention requires quadratic of the image size number of operations, limiting applications in computer vision where high resolution is necessary (e.g., instance segmentation).
...

Read 10 tweets

Artsiom Sanakoyeu

@artsiom_s

23 Mar

🔥New DALL-E? Paint by Word 🔥

Edit a generated image by painting a mask atany location of the image and specifying any text description. Or generate a full image just based on textual input.

📝arxiv.org/abs/2103.10951
1/

2/ Point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.”

3/
🛠️Two nets:
(1) a semantic similarity network C(x, t) that scores the semantic consistency between an image x and a text description t. It consists of two subnetworks: C_i(x) which embeds images and C_t(t) which embeds text.
(2) generative network G(z) that is trained to ...

Read 16 tweets

Artsiom Sanakoyeu

@artsiom_s

23 Mar

Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning

❓How?
Eliminate region-wise prediction and instead meta-learn object localization and classification at image level in a unified and complementary manner.

🛠️arxiv.org/abs/2103.11731

1/K ...👇

Specifically, the Meta-DETR first encodes both support and query images into category-specific
features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. ...
2/K

Authors propose a Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. ...
3/K

Read 5 tweets

Share this page!

Artsiom Sanakoyeu

Try unrolling a thread yourself!

More from @artsiom_s

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Did Thread Reader help you today?

Like this author's thread?