Today we will summarize Vision Transformer (ViT) from Google. Inspired by BERT, they have implemented the same architecture for image classification tasks.
Link: arxiv.org/abs/2010.11929
Code: github.com/google-researc…
#MachineLearning #DeepLearning
The authors have taken the Bert architecture and applied it on an images with minimal changes.Since the compute increases with the length of the sequence, instead of taking each pixel as a word, they propose to split the image into some ’N’ patches and take each of them as token.
So first take each patch, flatten it (which will be of length P²C), and project it linearly to dimension D. And in the 0th position add a ‘D’ dimensional embedding which will be learnt. Add positional encoding to these embedding.
Now the embedding is (N+1) x D dimension which we pass through the transformer encoder. And take the output of transformer encoder and take only the first column of the embedding, and pass it through an MLP to get the output.
⚡️Main result⚡️ ViT-H 14 (Vision transformer Huge model with patch size 14x14 pretrained on JFT-300M, and has 632M parameters!) performs the best on many vision datasets compared to ResNet152.
The authors empirically show that even the lower layers in the transformer are paying attention to farther away pixels (i.e. wider receptive field compared to CNNs)
Attention maps calculated using Attention Rollout seems to be highlighting the area relevant to the class.
Read the full summary here - medium.com/ml-summaries/v…
This summary is contributed by @gowthami_s.
Missed to attribute the gif - this is the source and pytorch implementation of the code - github.com/lucidrains/vit….
Also if you think this thread is helpful to you, please retweet and follow us! 🙂
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.