ML TLDR Profile picture
Not active anymore. Used to be run by @gowthami_s and @kamalgupta09.

May 31, 2021, 7 tweets

"Attention is all you need" is one of the most cited papers in last couple of years. What is attention? Let's try to understand in this thread.

Paper link: arxiv.org/abs/1706.03762

#DeepLearning #MachineLearning #Transformers

In Self-attention mechanism, we are updating the features of a given point, with respect to other features. The attention proposed in this paper is also known as Scaled dot-product attention.

Lets say, our data point is a single sentence, we embed each word into some d-dimensional space, so we compute how each point is similar to each other point, and weigh its representation accordingly. The similarity matrix is just a scaled dot product!

In reality this is how we execute it. For each feature, we calculate 3 vectors, key, query and value. For a given feature, we take the dot product of its query with key vector of all the features and scale it to get the similarity matrix.

Then we take the soft-max on similarity matrix . Output is nothing but similarity-matrix weighted sum of value vectors! That's it! Very simple right?!

We can extend this mechanism tto multiple heads too! Check out this articlee for clear illustrtations and explanations - jalammar.github.io/illustrated-tr…

And credits for above images go to this blog. Check it out for step by step illustrations and code - towardsdatascience.com/illustrated-se…

Thank you for reading. If you think this thread helped you learn something new, do retweet and follow us!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling