Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Jay Alammar

@JayAlammar

Dec 25, 2021 • 13 tweets • 6 min read • Read on X

Scrolly

A 🧵looking at DeepMind's Retro Transformer, which at 7.5B parameters is on par with GPT3 and models 25X its size in knowledge-intensive tasks.

A big moment for Large Language Models (LLMs) for reasons I'll mention in this thread.

deepmind.com/research/publi…

Language modeling trains models to predict the next word.

Sometimes, the completion requires knowledge of factual information. Other times, familiarity with language is enough (expressions, grammar).

Examples in the image. Completions:
1) 2021
2) time

Large GPTs had to encode everything they know in their model parameters. This makes sense for language data. But it's inefficient knowledge information (there so many facts).

Now the Language model can be much smaller, and a neural database helps it with retrieval.

3/n

This way, you get the following benefits:
1) The core language model can be much smaller. Which means it can be faster and easier to deploy on smaller GPUs.

2) To add new information to the model, you (may be able to) simply update the database without re-training.

4/n

Mechanically, it's an encoder-decoder model just like the original transformer, T5, or T0.

It uses the help of a neural database to augment its input, however.

5/n

The database looks like this.

It's a key-value store. The key is standard BERT embeddings.

The value is text in two parts:
1- Neighbor, which is used to compute the key
2- Completion, the continuation of the text in the original document.

Retro's database is 2 trillion tokens

How is the database incorporated?

This is the process:

Before hitting Retro, the input prompt actually goes into BERT.

The output contextualized vectors are then averaged to construct a sentence embedding vector.

That vector is then used to query the database.

7/n

That sentence embedding is then used in an approximate nearest neighbor search (using: github.com/google-researc…).

The two nearest neighbors are retrieved, and their text becomes a part of the input into Retro.

8/n

This is now the input to Retro. The input prompt and its two nearest neighbors from the database (and their continuations).

From here, the Transformer and Retro Blocks incorporate the information into their processing.

Architecture: An encoder stack and a decoder stack.

The 7.5B parameter model has 32 layers. So I'm thinking 16 in the encoder and 16 in the decoder (Counting the parameters should verify).

10/n

The encoder seems to be made of standard Transformer encoder blocks (self-attention + FFNN).

The decoder stack interleaves two kinds of decoder blocks:
- Decoder Block (Attn + FFNN)
- Retro Decoder Block (Attn + Chunked cross attention [CCA] + FFNN)

11/n

Correction: I now see the decoder is 32 layers.

Every third block starting from 9 is a Retro block (that allows its input to attend to the neighbors). So 9, 12, 15...32).

Decoder blocks only work on the input text. No enc-dec cross-attention in the model aside from CCA.

12/n

@EXGRV

The paper continues and builds on previous retrieval work including
openreview.net/forum?id=B184E… @EXGRV et al
arxiv.org/abs/2102.02557
@DaniYogatama et al
openreview.net/forum?id=HklBj… @ukhndlwl et al

and others

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @JayAlammar

Jay Alammar

@JayAlammar

Aug 19

The Illustrated GPT-OSS

New post! A visual tour of the architecture, message formatting, and reasoning of the latest GPT.

Link in 🧵

While we discuss the architecture, a lot more people would benefit from learning the message types and their individual purposes

We also go over the concept of channels introduced in the Harmony response format

Read 5 tweets

Jay Alammar

@JayAlammar

Dec 12, 2022

@CohereAI

New model alert!

@CohereAI's new embedding model supports 100+ languages and delivers 3X better performance than existing open-source models.

See the post by @Nils_Reimers and @amrmkayid: txt.cohere.ai/multilingual/

A glance at the benchmarks comparing it to:
- paraphrase-multilingual-mpnet-base-v2
- LaBSE
- Universal Sentence Encoder cMLM

(more details in the post)

Of course, I'd take the model for a spin and see how it visualizes texts (embed => UMAP).

Read 4 tweets

Jay Alammar

@JayAlammar

Nov 25, 2022

Big update to "The Illustrated Stable Diffusion" post

jalammar.github.io/illustrated-st…

14 new and updated visuals.

The biggest update is that forward diffusion is more precisely explained -- not as a process of steps (that are easy to confuse with de-noising steps).

-1-

Forward Diffusion is the process of making training examples by sampling an image, noise, and an amount of noise, and mixing them to create a training example.

-2-

Do this with lots of images and lots of noise samples & amounts, and there's a training dataset for your model -- the noise prediction Unet.

-3-

Read 6 tweets

Jay Alammar

@JayAlammar

Nov 4, 2022

@nickfrosst

So @nickfrosst lives in the future of LLMs, it feels to me. We have these internal demo sessions, and what Nick builds and presents often feels magical.

This video on github.com/cohere-ai/sand… is a taste of that.

Blog post: txt.cohere.ai/building-a-sea…

https://twitter.com/OdysseyToad/status/1588267196169351168

What this means to me, more precisely, is the dexterity of problem-solving with LLM primitives that break a problem into a pipeline of components :
1) Regex
2) GPT + prompt X
3) pipe that into by API Z
4) Embedding then similarity search
5) GPT + prompt Y

This is partly why I feel "Generative AI" is limited in describing the latest wave of what's possible with AI.

Representation (and by extension retrieval & classification) is just as important as Generation, but much more reliable in its results.

Read 4 tweets

Jay Alammar

@JayAlammar

Oct 4, 2022

The Illustrated Stable Diffusion

jalammar.github.io/illustrated-st…

New post!

Over 30 visuals explaining how Stable Diffusion works (diffusion, latent diffusion, CLIP, and a lot more).

When generating an image with Stable Diffusion, it's useful to think of 3 main components in the process.

1- Text encoder, translating words into numbers
2- Image information creator, takes multiple steps refining image information
3- Image decoder, paints the final image

-2-

Diffusion is the process that takes place inside the pink “image information creator” component.

It's a step-by-step process that produces an information array that the image decoder uses to paint the final image.

-3-

Read 5 tweets

Jay Alammar

@JayAlammar

Sep 20, 2022

AI image generation is the most recent mind-blowing AI capability.

#StableDiffusion is a clear milestone in this development because it made a high-performance model available to the masses.

This is how it works.

1/n

It is versatile in that it can be used in a number of different ways.

Text => Image (like the image above) is the main use case.

Another one is (Image + Text) => Image (Like this image). This is called img2img.

2/n

Stable Diffusion is a system made up of several components and models. It is not one monolithic model.

As we look under the hood, the first distinction we can make is that there’s:

1- A text understanding component
2- An image generation component

3/n

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Jay Alammar

Try unrolling a thread yourself!

More from @JayAlammar

Jay Alammar

Jay Alammar

Jay Alammar

Jay Alammar

Jay Alammar

Jay Alammar

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!