Today we will summarize Vision Transformer (ViT) from Google. Inspired by BERT, they have implemented the same architecture for image classification tasks.

Link: arxiv.org/abs/2010.11929
Code: github.com/google-researc…

#MachineLearning #DeepLearning
The authors have taken the Bert architecture and applied it on an images with minimal changes.Since the compute increases with the length of the sequence, instead of taking each pixel as a word, they propose to split the image into some ’N’ patches and take each of them as token.
So first take each patch, flatten it (which will be of length P²C), and project it linearly to dimension D. And in the 0th position add a ‘D’ dimensional embedding which will be learnt. Add positional encoding to these embedding.
Now the embedding is (N+1) x D dimension which we pass through the transformer encoder. And take the output of transformer encoder and take only the first column of the embedding, and pass it through an MLP to get the output.
⚡️Main result⚡️ ViT-H 14 (Vision transformer Huge model with patch size 14x14 pretrained on JFT-300M, and has 632M parameters!) performs the best on many vision datasets compared to ResNet152.
The authors empirically show that even the lower layers in the transformer are paying attention to farther away pixels (i.e. wider receptive field compared to CNNs)
Attention maps calculated using Attention Rollout seems to be highlighting the area relevant to the class.
Read the full summary here - medium.com/ml-summaries/v…

This summary is contributed by @gowthami_s.
Missed to attribute the gif - this is the source and pytorch implementation of the code - github.com/lucidrains/vit….

Also if you think this thread is helpful to you, please retweet and follow us! 🙂

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with ML TLDR

ML TLDR Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @MLsummaries

7 Apr
This paper shares 56 stories of researchers in Computer Vision, young and old, scientists and engineers. Reading it was a cocktail of emotions as you simultaneously relate to the stories of joy,excitement,cynicism,and fear. Give it a read!

#ComputerVision
Some quotes from the stories - it was a "tough and hopeless time" in computer vision "before 2012, [when] the annual performance improvements over ImageNet are quite marginal."
"she told me you should solve the problem purely based on deep learning... I did not think the occlusion problem can be solved without explicitly reasoning of shape priors and depth ordering"
Read 8 tweets
6 Apr
Depending on the problem we are trying to solve, the loss function varies. Today we are going to learn about Triplet losses. You must have heard about it while reading about Siamese networks.

#MachineLearning #DeepLearning #RepresentationLearning
Triplet loss is an important loss function for learning a good “representation”. What’s a representation you ask? Finding similarity (or difference) between two images is hard if you just use pixels.
So what do we do about it - given three images cat1, cat2, dog, we use a neural network to map the images to vectors f(cat1), f(cat2), and f(dog).
Read 5 tweets
29 Mar
To get the intuition behind the Machine Learning algorithms, we need to have some background in Math, especially Linear Algebra, Probability & Calculus. Consolidating a few cheat-sheets here. A thread 👇
For Linear Algebra: Topics include Vector spaces, Matrix vector operations, Rank of a matrix, Norms, Eigenvectors and values and a bit of Matrix calculus too.

souravsengupta.com/cds2016/lectur…

(Advanced) cs229.stanford.edu/section/cs229-…
For Probability & Statistics: Random variables, expectation, Probability distributions and so on.

stanford.edu/~shervine/teac…

stanford.edu/~shervine/teac…

(Advanced) cs229.stanford.edu/section/cs229-…
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!