Jay Alammar Profile picture
Dec 25, 2021 13 tweets 6 min read Read on X
A 🧵looking at DeepMind's Retro Transformer, which at 7.5B parameters is on par with GPT3 and models 25X its size in knowledge-intensive tasks.

A big moment for Large Language Models (LLMs) for reasons I'll mention in this thread.

deepmind.com/research/publi…
Language modeling trains models to predict the next word.

Sometimes, the completion requires knowledge of factual information. Other times, familiarity with language is enough (expressions, grammar).

Examples in the image. Completions:
1) 2021
2) time
Large GPTs had to encode everything they know in their model parameters. This makes sense for language data. But it's inefficient knowledge information (there so many facts).

Now the Language model can be much smaller, and a neural database helps it with retrieval.

3/n
This way, you get the following benefits:
1) The core language model can be much smaller. Which means it can be faster and easier to deploy on smaller GPUs.

2) To add new information to the model, you (may be able to) simply update the database without re-training.

4/n
Mechanically, it's an encoder-decoder model just like the original transformer, T5, or T0.

It uses the help of a neural database to augment its input, however.

5/n
The database looks like this.

It's a key-value store. The key is standard BERT embeddings.

The value is text in two parts:
1- Neighbor, which is used to compute the key
2- Completion, the continuation of the text in the original document.

Retro's database is 2 trillion tokens
How is the database incorporated?

This is the process:

Before hitting Retro, the input prompt actually goes into BERT.

The output contextualized vectors are then averaged to construct a sentence embedding vector.

That vector is then used to query the database.

7/n
That sentence embedding is then used in an approximate nearest neighbor search (using: github.com/google-researc…).

The two nearest neighbors are retrieved, and their text becomes a part of the input into Retro.

8/n
This is now the input to Retro. The input prompt and its two nearest neighbors from the database (and their continuations).

From here, the Transformer and Retro Blocks incorporate the information into their processing.
Architecture: An encoder stack and a decoder stack.

The 7.5B parameter model has 32 layers. So I'm thinking 16 in the encoder and 16 in the decoder (Counting the parameters should verify).

10/n
The encoder seems to be made of standard Transformer encoder blocks (self-attention + FFNN).

The decoder stack interleaves two kinds of decoder blocks:
- Decoder Block (Attn + FFNN)
- Retro Decoder Block (Attn + Chunked cross attention [CCA] + FFNN)

11/n
Correction: I now see the decoder is 32 layers.

Every third block starting from 9 is a Retro block (that allows its input to attend to the neighbors). So 9, 12, 15...32).

Decoder blocks only work on the input text. No enc-dec cross-attention in the model aside from CCA.

12/n
The paper continues and builds on previous retrieval work including
openreview.net/forum?id=B184E… @EXGRV et al
arxiv.org/abs/2102.02557
@DaniYogatama et al
openreview.net/forum?id=HklBj… @ukhndlwl et al

and others

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jay Alammar

Jay Alammar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @JayAlammar

Dec 12, 2022
New model alert!

@CohereAI's new embedding model supports 100+ languages and delivers 3X better performance than existing open-source models.

See the post by @Nils_Reimers and @amrmkayid: txt.cohere.ai/multilingual/
A glance at the benchmarks comparing it to:
- paraphrase-multilingual-mpnet-base-v2
- LaBSE
- Universal Sentence Encoder cMLM

(more details in the post)
Of course, I'd take the model for a spin and see how it visualizes texts (embed => UMAP).
Read 4 tweets
Nov 25, 2022
Big update to "The Illustrated Stable Diffusion" post

jalammar.github.io/illustrated-st…

14 new and updated visuals.

The biggest update is that forward diffusion is more precisely explained -- not as a process of steps (that are easy to confuse with de-noising steps).

-1-
Forward Diffusion is the process of making training examples by sampling an image, noise, and an amount of noise, and mixing them to create a training example.

-2-
Do this with lots of images and lots of noise samples & amounts, and there's a training dataset for your model -- the noise prediction Unet.

-3-
Read 6 tweets
Nov 4, 2022
So @nickfrosst lives in the future of LLMs, it feels to me. We have these internal demo sessions, and what Nick builds and presents often feels magical.



This video on github.com/cohere-ai/sand… is a taste of that.

Blog post: txt.cohere.ai/building-a-sea…
What this means to me, more precisely, is the dexterity of problem-solving with LLM primitives that break a problem into a pipeline of components :
1) Regex
2) GPT + prompt X
3) pipe that into by API Z
4) Embedding then similarity search
5) GPT + prompt Y Image
This is partly why I feel "Generative AI" is limited in describing the latest wave of what's possible with AI.

Representation (and by extension retrieval & classification) is just as important as Generation, but much more reliable in its results.
Read 4 tweets
Oct 4, 2022
The Illustrated Stable Diffusion

jalammar.github.io/illustrated-st…

New post!

Over 30 visuals explaining how Stable Diffusion works (diffusion, latent diffusion, CLIP, and a lot more).
When generating an image with Stable Diffusion, it's useful to think of 3 main components in the process.

1- Text encoder, translating words into numbers
2- Image information creator, takes multiple steps refining image information
3- Image decoder, paints the final image

-2-
Diffusion is the process that takes place inside the pink “image information creator” component.

It's a step-by-step process that produces an information array that the image decoder uses to paint the final image.

-3-
Read 5 tweets
Sep 20, 2022
AI image generation is the most recent mind-blowing AI capability.

#StableDiffusion is a clear milestone in this development because it made a high-performance model available to the masses.

This is how it works.

1/n Image
It is versatile in that it can be used in a number of different ways.

Text => Image (like the image above) is the main use case.

Another one is (Image + Text) => Image (Like this image). This is called img2img.

2/n Image
Stable Diffusion is a system made up of several components and models. It is not one monolithic model.

As we look under the hood, the first distinction we can make is that there’s:

1- A text understanding component
2- An image generation component

3/n Image
Read 9 tweets
Sep 19, 2022
The intelligence of generative LLMs is surprising. But people often overestimate it in areas while underestimating it in others.

Key concepts:
1- It's important to augment their information with the right sources.

So it's not
Human: Question
GPT: Factual answer.

It's more
2- To think of them as tools for surgically applying intelligence to subproblems, not as standalone intelligences themselves

Best if each module has its own tests and human verification of behavior

Human: question
System [step 1]
System [step 2]
System [step 3]
System: answer
In txt.cohere.ai/building-a-sea…, we show a couple of the tools from the playbook of "surgical application of language AI".

- Rewriting a question to include previous conversation context
- Retrieving relevant information using web search
- Answering now becomes extraction and..
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(