Tweet

Jay Alammar

25 Dec, 13 tweets, 6 min read

A 🧵looking at DeepMind's Retro Transformer, which at 7.5B parameters is on par with GPT3 and models 25X its size in knowledge-intensive tasks.

A big moment for Large Language Models (LLMs) for reasons I'll mention in this thread.

deepmind.com/research/publi…

Language modeling trains models to predict the next word.

Sometimes, the completion requires knowledge of factual information. Other times, familiarity with language is enough (expressions, grammar).

Examples in the image. Completions:
1) 2021
2) time

Large GPTs had to encode everything they know in their model parameters. This makes sense for language data. But it's inefficient knowledge information (there so many facts).

Now the Language model can be much smaller, and a neural database helps it with retrieval.

3/n

This way, you get the following benefits:
1) The core language model can be much smaller. Which means it can be faster and easier to deploy on smaller GPUs.

2) To add new information to the model, you (may be able to) simply update the database without re-training.

4/n

Mechanically, it's an encoder-decoder model just like the original transformer, T5, or T0.

It uses the help of a neural database to augment its input, however.

5/n

The database looks like this.

It's a key-value store. The key is standard BERT embeddings.

The value is text in two parts:
1- Neighbor, which is used to compute the key
2- Completion, the continuation of the text in the original document.

Retro's database is 2 trillion tokens

How is the database incorporated?

This is the process:

Before hitting Retro, the input prompt actually goes into BERT.

The output contextualized vectors are then averaged to construct a sentence embedding vector.

That vector is then used to query the database.

7/n

That sentence embedding is then used in an approximate nearest neighbor search (using: github.com/google-researc…).

The two nearest neighbors are retrieved, and their text becomes a part of the input into Retro.

8/n

This is now the input to Retro. The input prompt and its two nearest neighbors from the database (and their continuations).

From here, the Transformer and Retro Blocks incorporate the information into their processing.

Architecture: An encoder stack and a decoder stack.

The 7.5B parameter model has 32 layers. So I'm thinking 16 in the encoder and 16 in the decoder (Counting the parameters should verify).

10/n

The encoder seems to be made of standard Transformer encoder blocks (self-attention + FFNN).

The decoder stack interleaves two kinds of decoder blocks:
- Decoder Block (Attn + FFNN)
- Retro Decoder Block (Attn + Chunked cross attention [CCA] + FFNN)

11/n

Correction: I now see the decoder is 32 layers.

Every third block starting from 9 is a Retro block (that allows its input to attend to the neighbors). So 9, 12, 15...32).

Decoder blocks only work on the input text. No enc-dec cross-attention in the model aside from CCA.

12/n

@EXGRV

The paper continues and builds on previous retrieval work including
openreview.net/forum?id=B184E… @EXGRV et al
arxiv.org/abs/2102.02557
@DaniYogatama et al
openreview.net/forum?id=HklBj… @ukhndlwl et al

and others

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @JayAlammar

Jay Alammar

@JayAlammar

27 Apr

@rethinkmlpapers

Ecstatic to see "Machine learning research communication via illustrated and interactive web articles" published at @rethinkmlpapers workshop at #ICLR2021

In it, I describe my workflow for communicating ML to millions of readers.

Paper: openreview.net/pdf?id=WUrcJoy…

1/5

@ch402

I discuss five key ML communication artifacts:
1- The hero image
2- The Twitter thread
3- The illustrated article
4- The interactive article
5- Interpretability software

Here are excellent examples of 1 and 2 from @ch402, @karpathy , and @maithra_raghu.

2/5

For illustrated/animated articles, I discuss the importance of empathy towards the reader, putting intuition first, the importance of iteratively creating visual language to describe concepts, and reflect on pedagogical considerations.

3/5

Read 5 tweets

Jay Alammar

@JayAlammar

15 Feb

@numpy_team

Einsum is a key method in summing and multiplying tensors. It's implemented in @numpy_team , @TensorFlow , AND @PyTorch. Here's a visual intro to Einstein summation functions.

1/n

The einsum expression has two components.
1) Subscripts string -- notation indicating what we want to do with the input arrays.
2) The input arrays. It can be one or more arrays/tensors of varying dimensions.

The subscript follows the notation of mathematical formulas.

Here's an example of of summing the rows of one array.

Einsum sees that
1) The input array has two dimensions.
2)The subscripts have two unique letters.

It assigns the letters to the axes.

And then sums

Read 5 tweets

Jay Alammar

@JayAlammar

6 Feb

The covariance matrix is a an essential tool for analyzing relationship in data. In numpy, you can use the np.cov() function to calculate it (numpy.org/doc/stable/ref…).

Here's a shot at visualizing the elements of the covariance matrix and what they mean:
1/5

Along the diagonal are the variance of each row. Variance indicates the spread of the values (1 and 4 in this case) from their average.

2/5

Another variance calculation along the diagonal. You can also calculate it with Wolfram Alpha: wolframalpha.com/input/?i=varia…
3/5

Read 5 tweets

Jay Alammar

@JayAlammar

19 Jan

@nostalgebraist

Finding the Words to Say: Hidden State Visualizations for Language Models

jalammar.github.io/hidden-states/

New post! Visualizations glancing at the "thought process" of language models & how it evolves between layers. Builds on awesome work by @nostalgebraist @lena_voita @tallinzen. 1/n

If we ask GPT2 to fill-in the blank:
Heathrow airport is located in the city of ___

Would it fill it correctly?

Yes.

Which layer stored that knowledge?

The various layers successively increased the ranking of the word "London", but Layer 5 ultimately raised it from 37 to 1.

Another view (also powered by eccox.io) compares the rankings of two words to fill one blank.

Which layers would recognize that "are" is the correct word for this blank?

Only Layer 5. All others rank "is" higher than "are".

Read 6 tweets

Jay Alammar

@JayAlammar

21 Nov 20

So many fascinating ideas at yesterday's #blackboxNLP workshop at #emnlp2020. Too many bookmarked papers. Some takeaways:
1- There's more room to adopt input saliency methods in NLP. With Grad*input and Integrated Gradients being key gradient-based methods.

See: aclweb.org/anthology/2020… aclweb.org/anthology/2020… aclweb.org/anthology/2020…

@IbanDlank

2- NLP language model (GPT2-XL especially -- rightmost in graph) accurately predict neural response in the human brain. The next-word prediction task robustly predicts neural scores. @IbanDlank @martin_schrimpf @ev_fedorenko

biorxiv.org/content/10.110…

Read 6 tweets

Jay Alammar

@JayAlammar

21 Jul 20

How GPT3 works. A visual thread.

A trained language model generates text.

We can optionally pass it some text as input, which influences its output.

The output is generated from what the model "learned" during its training period where it scanned vast amounts of text.

1/n

Training is the process of exposing the model to lots of text. It has been done once and complete. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.

2/n

The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence at the top.

You can see how you can slide a window across all the text and make lots of examples.

3/n

Read 14 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Jay Alammar

Try unrolling a thread yourself!

More from @JayAlammar

Jay Alammar

Jay Alammar

Jay Alammar

Jay Alammar

Jay Alammar

Jay Alammar

Did Thread Reader help you today?

Like this author's thread?