Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of images for training
1/
The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
...
2/
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, ...
3/ ... random noise, random cut-outs) and train a decoder to restore the original view.
However, sometimes the alignment between medical images is not perfect by default and images may depict different body parts.
To make sure that the images which we use to crop patches ...
4/ ... are aligned, we train an autoencoder on full images (before cropping) and select only similar images by comparing the distances between them in the learned autoencoder latent space.
Authors show that their method is significantly better than ...
5/ ... other self-supervised learning approaches on medical data and can even be combined with existing self-supervised methods like RotNet (predicting random rotations of the image).
6/ But unfortunately, the comparison is rather limited, and they didn't compare to Jigsaw Puzzle, SwaV, or recent contrastive self-supervised methods like MoCO, BYOL, and SimCLR.
1/
🛠️How? 1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition. 2. Project an input image in StyleGAN latent vector w_s.
...
2/
3. Now, given a source latent code w_s∈ W+, and a directive in natural language, or a text prompt t, we iteratively minimize the sum of three losses by changing the latent code w:
a) Distance between generated by StyleGAN image and the text query;
...
2/
❓Why?
There are two main problems with the usage of Transformers for computer vision. 1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes in img)
3/ 2. Regular self-attention requires quadratic of the image size number of operations, limiting applications in computer vision where high resolution is necessary (e.g., instance segmentation).
...
Edit a generated image by painting a mask atany location of the image and specifying any text description. Or generate a full image just based on textual input.
2/ Point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.”
3/
🛠️Two nets: (1) a semantic similarity network C(x, t) that scores the semantic consistency between an image x and a text description t. It consists of two subnetworks: C_i(x) which embeds images and C_t(t) which embeds text. (2) generative network G(z) that is trained to ...
Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning
❓How?
Eliminate region-wise prediction and instead meta-learn object localization and classification at image level in a unified and complementary manner.
Specifically, the Meta-DETR first encodes both support and query images into category-specific
features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. ...
2/K
Authors propose a Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. ...
3/K
2/ It is the largest (afaik) publicly available GPT-3 replica. The primary goal of this project is to replicate a full-sized GPT-3 model and open source it to the public, for free.
The models were trained on an open-source dataset The Pile pile.eleuther.ai which ...
To learn about differences between the two -> thread 👇
1/ The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions p=(x,y,z) of the voxel and one that depends only on the ray directions v.
Essentially you predict K different (R,G,B) values for ever voxel...
2/ Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars H_i(v) for each of them:
color(x,y,z) = RGB_1 * H_1 + RGB_2 * H_2 + ... + RGB_K * H_K. This is inspired by the rendering equation.
...