2/
❓Why?
There are two main problems with the usage of Transformers for computer vision. 1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes in img)
3/ 2. Regular self-attention requires quadratic of the image size number of operations, limiting applications in computer vision where high resolution is necessary (e.g., instance segmentation).
...
4/ 🥊 Main ideas of Swin Transformers: 1. Hierarchical feature maps where at each level of hierarchy Self-attention is applied within local non-overlapping windows. The size of the windows is progressively increased with the network depth (inspired by CNNs). ...
5/
...This enables building architectures similar to feature pyramid networks (FPN) or U-Net for dense pixel-level tasks.
2. Window-based Self-attention reduces the computational overhead.
6/ ⚙️ Overall Architecture consists of repeating the following blocks:
- Split RGB image into non-overlapping patches (tokens).
- Apply MLP to translate raw features into an arbitrary dimension.
- Apply 2 consecutive Swin Transformer blocks with Window self-attention: both ...
7/..blocks have the same window size, but the second block uses shifted by patch_size/2 windows which allows information flow between non-overlapping windows
- Downsampling layer: Reduce the number of tokens by merging neighboring patches in a 2x2 window, and double feature depth
8/
🦾 Results
+ Outperforms SOTA by a significant margin on COCO segmentation and detection tasks and ADE20K segmentation.
+ Comparable accuracy to the EfficientNet family on ImageNet-1K classification, while being faster.
👌Conclusion
While Transformers are super flexible, researchers start to inject in Transformers inductive biases similar to those in CNNs, e.g., local connectivity, feature hierarchies. And this seems to help tremendously!
That's it!
👉 Join my Telegram channel "Gradient Dude" not to miss the latest posts like this! t.me/gradientdude
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Edit a generated image by painting a mask atany location of the image and specifying any text description. Or generate a full image just based on textual input.
2/ Point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.”
3/
🛠️Two nets: (1) a semantic similarity network C(x, t) that scores the semantic consistency between an image x and a text description t. It consists of two subnetworks: C_i(x) which embeds images and C_t(t) which embeds text. (2) generative network G(z) that is trained to ...
Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning
❓How?
Eliminate region-wise prediction and instead meta-learn object localization and classification at image level in a unified and complementary manner.
Specifically, the Meta-DETR first encodes both support and query images into category-specific
features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. ...
2/K
Authors propose a Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. ...
3/K
2/ It is the largest (afaik) publicly available GPT-3 replica. The primary goal of this project is to replicate a full-sized GPT-3 model and open source it to the public, for free.
The models were trained on an open-source dataset The Pile pile.eleuther.ai which ...
To learn about differences between the two -> thread 👇
1/ The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions p=(x,y,z) of the voxel and one that depends only on the ray directions v.
Essentially you predict K different (R,G,B) values for ever voxel...
2/ Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars H_i(v) for each of them:
color(x,y,z) = RGB_1 * H_1 + RGB_2 * H_2 + ... + RGB_K * H_K. This is inspired by the rendering equation.
...
... they train a regressor network to predict
the latent code from an input image. To teach the regressor to predict the latent code for images w/ missing pixels they mask random patches during training.
Now, given an input collage, the regressor projects it into a reasonable...