New post! Visualizations glancing at the "thought process" of language models & how it evolves between layers. Builds on awesome work by @nostalgebraist@lena_voita@tallinzen. 1/n
If we ask GPT2 to fill-in the blank:
Heathrow airport is located in the city of ___
Would it fill it correctly?
Yes.
Which layer stored that knowledge?
The various layers successively increased the ranking of the word "London", but Layer 5 ultimately raised it from 37 to 1.
Another view (also powered by eccox.io) compares the rankings of two words to fill one blank.
Which layers would recognize that "are" is the correct word for this blank?
Only Layer 5. All others rank "is" higher than "are".
If you're new to language models, Ecco also provides a simple view for the scores of output tokens.
You can visualize the outputs of the model (so only the last layer). You can also see the top scoring tokens for all the layers.
So many fascinating ideas at yesterday's #blackboxNLP workshop at #emnlp2020. Too many bookmarked papers. Some takeaways: 1- There's more room to adopt input saliency methods in NLP. With Grad*input and Integrated Gradients being key gradient-based methods.
2- NLP language model (GPT2-XL especially -- rightmost in graph) accurately predict neural response in the human brain. The next-word prediction task robustly predicts neural scores. @IbanDlank@martin_schrimpf@ev_fedorenko
We can optionally pass it some text as input, which influences its output.
The output is generated from what the model "learned" during its training period where it scanned vast amounts of text.
1/n
Training is the process of exposing the model to lots of text. It has been done once and complete. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.
2/n
The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence at the top.
You can see how you can slide a window across all the text and make lots of examples.
On the transformer side of #acl2020nlp, three works stood out to me as relevant if you've followed the Illustrated Transformer/BERT series on my blog: 1- SpanBERT 2- BART 3- Quantifying Attention Flow
(1/n)
SpanBERT (by @mandarjoshi_@danqi_chen@YinhanL@dsweld@LukeZettlemoyer@omerlevy_) came out last year but was published in this year's ACL. It found that BERT pre-training is better when you mask continuous strings of tokens, rather than BERT's 15% scattered tokens.