Language modeling trains models to predict the next word.
Sometimes, the completion requires knowledge of factual information. Other times, familiarity with language is enough (expressions, grammar).
Examples in the image. Completions: 1) 2021 2) time
Large GPTs had to encode everything they know in their model parameters. This makes sense for language data. But it's inefficient knowledge information (there so many facts).
Now the Language model can be much smaller, and a neural database helps it with retrieval.
3/n
This way, you get the following benefits: 1) The core language model can be much smaller. Which means it can be faster and easier to deploy on smaller GPUs.
2) To add new information to the model, you (may be able to) simply update the database without re-training.
4/n
Mechanically, it's an encoder-decoder model just like the original transformer, T5, or T0.
It uses the help of a neural database to augment its input, however.
5/n
The database looks like this.
It's a key-value store. The key is standard BERT embeddings.
The value is text in two parts: 1- Neighbor, which is used to compute the key 2- Completion, the continuation of the text in the original document.
Retro's database is 2 trillion tokens
How is the database incorporated?
This is the process:
Before hitting Retro, the input prompt actually goes into BERT.
The output contextualized vectors are then averaged to construct a sentence embedding vector.
That vector is then used to query the database.
7/n
That sentence embedding is then used in an approximate nearest neighbor search (using: github.com/google-researc…).
The two nearest neighbors are retrieved, and their text becomes a part of the input into Retro.
8/n
This is now the input to Retro. The input prompt and its two nearest neighbors from the database (and their continuations).
From here, the Transformer and Retro Blocks incorporate the information into their processing.
Architecture: An encoder stack and a decoder stack.
The 7.5B parameter model has 32 layers. So I'm thinking 16 in the encoder and 16 in the decoder (Counting the parameters should verify).
10/n
The encoder seems to be made of standard Transformer encoder blocks (self-attention + FFNN).
The decoder stack interleaves two kinds of decoder blocks:
- Decoder Block (Attn + FFNN)
- Retro Decoder Block (Attn + Chunked cross attention [CCA] + FFNN)
11/n
Correction: I now see the decoder is 32 layers.
Every third block starting from 9 is a Retro block (that allows its input to attend to the neighbors). So 9, 12, 15...32).
Decoder blocks only work on the input text. No enc-dec cross-attention in the model aside from CCA.
The biggest update is that forward diffusion is more precisely explained -- not as a process of steps (that are easy to confuse with de-noising steps).
-1-
Forward Diffusion is the process of making training examples by sampling an image, noise, and an amount of noise, and mixing them to create a training example.
-2-
Do this with lots of images and lots of noise samples & amounts, and there's a training dataset for your model -- the noise prediction Unet.
So @nickfrosst lives in the future of LLMs, it feels to me. We have these internal demo sessions, and what Nick builds and presents often feels magical.
What this means to me, more precisely, is the dexterity of problem-solving with LLM primitives that break a problem into a pipeline of components : 1) Regex 2) GPT + prompt X 3) pipe that into by API Z 4) Embedding then similarity search 5) GPT + prompt Y
This is partly why I feel "Generative AI" is limited in describing the latest wave of what's possible with AI.
Representation (and by extension retrieval & classification) is just as important as Generation, but much more reliable in its results.
Over 30 visuals explaining how Stable Diffusion works (diffusion, latent diffusion, CLIP, and a lot more).
When generating an image with Stable Diffusion, it's useful to think of 3 main components in the process.
1- Text encoder, translating words into numbers 2- Image information creator, takes multiple steps refining image information 3- Image decoder, paints the final image
-2-
Diffusion is the process that takes place inside the pink “image information creator” component.
It's a step-by-step process that produces an information array that the image decoder uses to paint the final image.
The intelligence of generative LLMs is surprising. But people often overestimate it in areas while underestimating it in others.
Key concepts: 1- It's important to augment their information with the right sources.
So it's not
Human: Question
GPT: Factual answer.
It's more
2- To think of them as tools for surgically applying intelligence to subproblems, not as standalone intelligences themselves
Best if each module has its own tests and human verification of behavior
Human: question
System [step 1]
System [step 2]
System [step 3]
System: answer
In txt.cohere.ai/building-a-sea…, we show a couple of the tools from the playbook of "surgical application of language AI".
- Rewriting a question to include previous conversation context
- Retrieving relevant information using web search
- Answering now becomes extraction and..