In Self-attention mechanism, we are updating the features of a given point, with respect to other features. The attention proposed in this paper is also known as Scaled dot-product attention.
Lets say, our data point is a single sentence, we embed each word into some d-dimensional space, so we compute how each point is similar to each other point, and weigh its representation accordingly. The similarity matrix is just a scaled dot product!
In reality this is how we execute it. For each feature, we calculate 3 vectors, key, query and value. For a given feature, we take the dot product of its query with key vector of all the features and scale it to get the similarity matrix.
Then we take the soft-max on similarity matrix . Output is nothing but similarity-matrix weighted sum of value vectors! That's it! Very simple right?!
We can extend this mechanism tto multiple heads too! Check out this articlee for clear illustrtations and explanations - jalammar.github.io/illustrated-tr…
This paper shares 56 stories of researchers in Computer Vision, young and old, scientists and engineers. Reading it was a cocktail of emotions as you simultaneously relate to the stories of joy,excitement,cynicism,and fear. Give it a read!
Some quotes from the stories - it was a "tough and hopeless time" in computer vision "before 2012, [when] the annual performance improvements over ImageNet are quite marginal."
"she told me you should solve the problem purely based on deep learning... I did not think the occlusion problem can be solved without explicitly reasoning of shape priors and depth ordering"
Today we will summarize Vision Transformer (ViT) from Google. Inspired by BERT, they have implemented the same architecture for image classification tasks.
The authors have taken the Bert architecture and applied it on an images with minimal changes.Since the compute increases with the length of the sequence, instead of taking each pixel as a word, they propose to split the image into some ’N’ patches and take each of them as token.
So first take each patch, flatten it (which will be of length P²C), and project it linearly to dimension D. And in the 0th position add a ‘D’ dimensional embedding which will be learnt. Add positional encoding to these embedding.
Depending on the problem we are trying to solve, the loss function varies. Today we are going to learn about Triplet losses. You must have heard about it while reading about Siamese networks.
Triplet loss is an important loss function for learning a good “representation”. What’s a representation you ask? Finding similarity (or difference) between two images is hard if you just use pixels.
So what do we do about it - given three images cat1, cat2, dog, we use a neural network to map the images to vectors f(cat1), f(cat2), and f(dog).
To get the intuition behind the Machine Learning algorithms, we need to have some background in Math, especially Linear Algebra, Probability & Calculus. Consolidating a few cheat-sheets here. A thread 👇
For Linear Algebra: Topics include Vector spaces, Matrix vector operations, Rank of a matrix, Norms, Eigenvectors and values and a bit of Matrix calculus too.