Thread by @MLsummaries on Thread Reader App

"Attention is all you need" is one of the most cited papers in last couple of years. What is attention? Let's try to understand in this thread.

Paper link: arxiv.org/abs/1706.03762

#DeepLearning #MachineLearning #Transformers

In Self-attention mechanism, we are updating the features of a given point, with respect to other features. The attention proposed in this paper is also known as Scaled dot-product attention.

Lets say, our data point is a single sentence, we embed each word into some d-dimensional space, so we compute how each point is similar to each other point, and weigh its representation accordingly. The similarity matrix is just a scaled dot product!

In reality this is how we execute it. For each feature, we calculate 3 vectors, key, query and value. For a given feature, we take the dot product of its query with key vector of all the features and scale it to get the similarity matrix.

Then we take the soft-max on similarity matrix . Output is nothing but similarity-matrix weighted sum of value vectors! That's it! Very simple right?!

We can extend this mechanism tto multiple heads too! Check out this articlee for clear illustrtations and explanations - jalammar.github.io/illustrated-tr…

And credits for above images go to this blog. Check it out for step by step illustrations and code - towardsdatascience.com/illustrated-se…

Thank you for reading. If you think this thread helped you learn something new, do retweet and follow us!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll