"Attention is all you need" is one of the most cited papers in last couple of years. What is attention? Let's try to understand in this thread.
Paper link: arxiv.org/abs/1706.03762
#DeepLearning #MachineLearning #Transformers
In Self-attention mechanism, we are updating the features of a given point, with respect to other features. The attention proposed in this paper is also known as Scaled dot-product attention.
Lets say, our data point is a single sentence, we embed each word into some d-dimensional space, so we compute how each point is similar to each other point, and weigh its representation accordingly. The similarity matrix is just a scaled dot product!
In reality this is how we execute it. For each feature, we calculate 3 vectors, key, query and value. For a given feature, we take the dot product of its query with key vector of all the features and scale it to get the similarity matrix.
Then we take the soft-max on similarity matrix . Output is nothing but similarity-matrix weighted sum of value vectors! That's it! Very simple right?!
We can extend this mechanism tto multiple heads too! Check out this articlee for clear illustrtations and explanations - jalammar.github.io/illustrated-tr…
And credits for above images go to this blog. Check it out for step by step illustrations and code - towardsdatascience.com/illustrated-se…
Thank you for reading. If you think this thread helped you learn something new, do retweet and follow us!
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.