Einsum is a key method in summing and multiplying tensors. It's implemented in @numpy_team , @TensorFlow , AND @PyTorch. Here's a visual intro to Einstein summation functions.

1/n
The einsum expression has two components.
1) Subscripts string -- notation indicating what we want to do with the input arrays.
2) The input arrays. It can be one or more arrays/tensors of varying dimensions.
The subscript follows the notation of mathematical formulas.

Here's an example of of summing the rows of one array.

Einsum sees that
1) The input array has two dimensions.
2)The subscripts have two unique letters.

It assigns the letters to the axes.

And then sums
This is how the subscripts correspond to the important pieces of information in the mathematical formula.
This is how it maps for multiplication. The comma indicates multiplication. It is necessary when we pass more than one array/tensor to einsum.

A lot more examples in @_rockt's excellent post: rockt.github.io/2018/04/30/ein…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jay Alammar

Jay Alammar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @JayAlammar

6 Feb
The covariance matrix is a an essential tool for analyzing relationship in data. In numpy, you can use the np.cov() function to calculate it (numpy.org/doc/stable/ref…).

Here's a shot at visualizing the elements of the covariance matrix and what they mean:
1/5
Along the diagonal are the variance of each row. Variance indicates the spread of the values (1 and 4 in this case) from their average.

2/5
Another variance calculation along the diagonal. You can also calculate it with Wolfram Alpha: wolframalpha.com/input/?i=varia…
3/5
Read 5 tweets
19 Jan
Finding the Words to Say: Hidden State Visualizations for Language Models

jalammar.github.io/hidden-states/

New post! Visualizations glancing at the "thought process" of language models & how it evolves between layers. Builds on awesome work by @nostalgebraist @lena_voita @tallinzen. 1/n Image
If we ask GPT2 to fill-in the blank:
Heathrow airport is located in the city of ___

Would it fill it correctly?

Yes.

Which layer stored that knowledge?

The various layers successively increased the ranking of the word "London", but Layer 5 ultimately raised it from 37 to 1. Image
Another view (also powered by eccox.io) compares the rankings of two words to fill one blank.

Which layers would recognize that "are" is the correct word for this blank?

Only Layer 5. All others rank "is" higher than "are". Image
Read 6 tweets
21 Nov 20
So many fascinating ideas at yesterday's #blackboxNLP workshop at #emnlp2020. Too many bookmarked papers. Some takeaways:
1- There's more room to adopt input saliency methods in NLP. With Grad*input and Integrated Gradients being key gradient-based methods.
2- NLP language model (GPT2-XL especially -- rightmost in graph) accurately predict neural response in the human brain. The next-word prediction task robustly predicts neural scores. @IbanDlank @martin_schrimpf @ev_fedorenko

biorxiv.org/content/10.110…
Read 6 tweets
21 Jul 20
How GPT3 works. A visual thread.

A trained language model generates text.

We can optionally pass it some text as input, which influences its output.

The output is generated from what the model "learned" during its training period where it scanned vast amounts of text.

1/n
Training is the process of exposing the model to lots of text. It has been done once and complete. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.

2/n
The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence at the top.

You can see how you can slide a window across all the text and make lots of examples.

3/n
Read 14 tweets
14 Jul 20
On the transformer side of #acl2020nlp, three works stood out to me as relevant if you've followed the Illustrated Transformer/BERT series on my blog:
1- SpanBERT
2- BART
3- Quantifying Attention Flow
(1/n)
SpanBERT (by @mandarjoshi_ @danqi_chen @YinhanL @dsweld @LukeZettlemoyer @omerlevy_) came out last year but was published in this year's ACL. It found that BERT pre-training is better when you mask continuous strings of tokens, rather than BERT's 15% scattered tokens. ImageImage
BART (@ml_perception @YinhanL @gh_marjan @omerlevy_ @vesko_st @LukeZettlemoyer) presents a way to use what we've learned from BERT (and spanBERT) back into encoder-decoder models, which are especially important for summarization, machine translation, and chatbots. 3/n #acl2020nlp ImageImageImage
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!