Tweet

@threadreaderapp

More from @artsiom_s

Artsiom Sanakoyeu

@artsiom_s

1 Apr

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 🔥

Use CLIP model in order to navigate image editing in StyleGAN by text queries.

📝Paper arxiv.org/abs/2103.17249
⚙️ code github.com/orpatashnik/St…

Thread 👇

1/
🛠️How?
1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition.
2. Project an input image in StyleGAN latent vector w_s.
...

2/

3. Now, given a source latent code w_s∈ W+, and a directive in natural language, or a text prompt t, we iteratively minimize the sum of three losses by changing the latent code w:
a) Distance between generated by StyleGAN image and the text query;
...

Read 9 tweets

Artsiom Sanakoyeu

@artsiom_s

29 Mar

Swin Transformer: New SOTA backbone for Computer Vision🔥

👉 What?
New vision Transformer architecture called Swin Transformer that can serve as a backbone in computer vision instead of CNNs.

📝 arxiv.org/abs/2103.14030
⚒ Code (soon) github.com/microsoft/Swin…

Thread 👇

2/
❓Why?
There are two main problems with the usage of Transformers for computer vision.
1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes in img)

3/
2. Regular self-attention requires quadratic of the image size number of operations, limiting applications in computer vision where high resolution is necessary (e.g., instance segmentation).
...

Read 10 tweets

Artsiom Sanakoyeu

@artsiom_s

23 Mar

🔥New DALL-E? Paint by Word 🔥

Edit a generated image by painting a mask atany location of the image and specifying any text description. Or generate a full image just based on textual input.

📝arxiv.org/abs/2103.10951
1/

2/ Point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.”

3/
🛠️Two nets:
(1) a semantic similarity network C(x, t) that scores the semantic consistency between an image x and a text description t. It consists of two subnetworks: C_i(x) which embeds images and C_t(t) which embeds text.
(2) generative network G(z) that is trained to ...

Read 16 tweets

Artsiom Sanakoyeu

@artsiom_s

23 Mar

Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning

❓How?
Eliminate region-wise prediction and instead meta-learn object localization and classification at image level in a unified and complementary manner.

🛠️arxiv.org/abs/2103.11731

1/K ...👇

Specifically, the Meta-DETR first encodes both support and query images into category-specific
features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. ...
2/K

Authors propose a Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. ...
3/K

Read 5 tweets

Artsiom Sanakoyeu

@artsiom_s

23 Mar

Open source 2.7 billion parameter GPT-3 model was released

github.com/EleutherAI/gpt…

As you probably know OpenAI has not released source code or pre-trained weights for their 175 billion language model GPT-3.

A thread 👇

1/ Instead, OpenAI decided to create a commercial product and exclusively license GPT-3 to Microsoft.

But open-source enthusiasts from eleuther.ai have open-sourced the weights of 1.3B and 2.7B param models of their replication of GPT-3

🛠️github.com/EleutherAI/gpt…

2/ It is the largest (afaik) publicly available GPT-3 replica. The primary goal of this project is to replicate a full-sized GPT-3 model and open source it to the public, for free.
The models were trained on an open-source dataset The Pile pile.eleuther.ai which ...

Read 16 tweets

Artsiom Sanakoyeu

@artsiom_s

21 Mar

⚔️ FastNeRF vs NeX ⚔️

Smart ideas do not come in the only head. FastNeRF has the same idea as in NeX, but a bit different implementation. Which one is Faster?

Nex nex-mpi.github.io
FastNeRF arxiv.org/abs/2103.10380

To learn about differences between the two -> thread 👇

1/ The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions p=(x,y,z) of the voxel and one that depends only on the ray directions v.
Essentially you predict K different (R,G,B) values for ever voxel...

https://twitter.com/artsiom_s/status/1373464655935471616?s=20

2/ Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars H_i(v) for each of them:
color(x,y,z) = RGB_1 * H_1 + RGB_2 * H_2 + ... + RGB_K * H_K. This is inspired by the rendering equation.
...

https://twitter.com/artsiom_s/status/1373464655935471616?s=20

Read 11 tweets

Share this page!

Artsiom Sanakoyeu

Try unrolling a thread yourself!

More from @artsiom_s

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Artsiom Sanakoyeu

Did Thread Reader help you today?

Like this author's thread?