"Attention is all you need" is one of the most cited papers in last couple of years. What is attention? Let's try to understand in this thread.

In Self-attention mechanism, we are updating the features of a given point, with respect to other features. The attention proposed in this paper is also known as Scaled dot-product attention.

Lets say, our data point is a single sentence, we embed each word into some d-dimensional space, so we compute how each point is similar to each other point, and weigh its representation accordingly. The similarity matrix is just a scaled dot product!

In reality this is how we execute it. For each feature, we calculate 3 vectors, key, query and value. For a given feature, we take the dot product of its query with key vector of all the features and scale it to get the similarity matrix.

Then we take the soft-max on similarity matrix . Output is nothing but similarity-matrix weighted sum of value vectors! That's it! Very simple right?!

