Cameron R. Wolfe, Ph.D. Profile picture
Director of AI @RebuyEngine • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable
fly51fly Profile picture Jin Ho Hur Profile picture Rodolpho Gurgel Profile picture Chung Profile picture TheSamurai Profile picture 7 subscribed
Mar 25 4 tweets 4 min read
Although mixture-of-experts (MoEs) were initially applied to LSTM-based language models [1], the Switch Transformer [2] was one of the first papers to apply MoEs to the transformer. Here’s how it works…

Motivation. After the proposal of the sparsely-gated MoE in [1], adoption was hindered by the general complexity of MoEs, as well as issues like high communication costs and training instability. Authors in [2] propose an MoE-based encoder-decoder transformer architecture, called the Switch Transformer, that uses a simplified gating mechanism to make training more stable, thus making MoEs a more realistic and practical choice for language modeling applications.

MoE for transformers. To create an MoE encoder-decoder transformer, we can simply convert the feed-forward sub-layers of the model into MoE layers. The feed-forward transformation is applied in a pointwise fashion, meaning that each token is passed individually through the feed-forward network. For this reason, each token within the sequence is individually routed to its set of corresponding experts.

For example, each token in the sequence [“I”, “love”, “LLM”, “s”] is passed through the routing function, forming a probability distribution over experts. Then, we select the top-K experts for each individual token—tokens in the same sequence are not always sent to the same experts.

Better routing. In [1], the minimum number of active experts in any MoE layer was two—this was thought to be necessary to have non-trivial gradients in the routing function. In [2], authors propose routing each token to only a single expert, which is called a switch layer. This simplifies the routing function, reduces computational overhead, and lessens communication costs while improving the model’s performance.

The routing function used by the Switch Transformer is just a softmax gating mechanism. We pass each token vector through a linear layer that produces an output of size N (i.e., the number of experts), then apply a softmax transformation to convert this output into a probability distribution over experts. From here, we compute the output of the switch layer by:

1. Selecting a single expert.
2. Scaling the output of this expert by the probability assigned to that expert by the routing function.

Simple load balancing. In [1], authors employ multiple auxiliary loss functions to balance importance scores and perform load balancing between experts (i.e., meaning that each expert is sent a roughly equal number of tokens from the batch). We see in [2] that both of these objectives can be achieved with a single auxiliary loss function that is applied at each switch layer in the model. This loss encourages both the fraction of tokens allocated to each expert and the fraction of router probability allocated to each expert to be 1/N, meaning that experts are equally important and receive a balanced number of tokens.

Capacity factor. Within the Switch Transformer, we set a global “expert capacity” variable that determines the maximum number of tokens that can be routed to each expert in any MoE layer. Each token is routed to the expert that is assigned the highest probability by the routing mechanism. If too many tokens (i.e., exceeding the expert capacity) are sent to a single expert, computation for these tokens will be skipped. These “dropped” tokens are passed directly to the next layer via the residual connection. Setting the capacity factor greater than one allows the MoE to handle imbalanced tokens across experts.

——
Bibliography
[1] Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).
[2] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.120 (2022): 1-39.Image To gain a better understanding of the basic concepts behind MoEs, check out my overview of the topic below. The MoE layer is a key advancement in LLM research that powers popular LLMs like Grok-1 and (per rumors in the community) GPT-4.

Mar 8 4 tweets 4 min read
Masked self-attention is the key building block that allows LLMs to learn rich relationships and patterns between the words of a sentence. Let’s build it together from scratch…

The big picture: Large language models are based upon a deep neural network architecture called a decoder-only transformer. Within each layer of this model, we have two key components:

1. Masked self-attention: learns relationships between tokens/words.
2. Feed-forward transformation: individually transforms the representation of each word.

These components are complementary—attention looks across the sequence, while feed-forward transformations consider each token individually. When combined together, they allow us to learn complex patterns from text that power the AI applications that are so popular today.

TL;DR: The input to an attention model is a list of token/word vectors, which can be stacked together to form a matrix. Causal self-attention operates by computing an attention/importance score between each pair of tokens/words in a sequence. Then, the output of self-attention is a weighted combination of all words in the sequence, where the weight is given by the attention score. We can break the process of masked self-attention into a sequence of five steps.

(1) Linear projections: The first step is to perform three separate linear projections, called the query, key, and value projections. Practically, these projections take our sequence of token vectors as input and produce three transformed sequence of token vectors as output.

(2) Attention scores: To compute attentions scores, we use the query and key vectors produced by the linear projections described above. The attention score between the i-th token and the j-th token in the sequence is given by the dot product of the i-th query vector and the j-th key vector. To compute all of these pairwise scores efficiently, we can stack the query/key vectors into matrices and take the matrix product of the query matrix with the transposed key matrix. The output is a TxT attention matrix, where T is the length of the input sequence (in tokens). To improve training stability, we also divide the values of the attention matrix by the square root of the size of the token vectors (i.e., scaled dot product attention).

(3) Forming a probability distribution: From here, we can turn the attention scores for each token into a probability distribution by performing a softmax operation across each token’s attention scores for the sequence. In practice, this is done via a softmax operation across each row of the attention matrix. After this, each row of the attention matrix becomes a probability distribution that represents the (normalized) attention scores for a single token across the sequence (i.e., the i-th row contains the i-th token’s attention scores).

(4) Masking operation: In vanilla self-attention, each token is allowed to compute attention scores for all tokens in the sequence. In masked self-attention, however, we mask attention scores for any token that follows a given token in the sequence. We can implement this by simply masking the attention matrix prior to performing the softmax (i.e., fill entries for any invalid attention scores with a value of negative infinity), such that the probability of any future token in the sequence becomes zero. For example, the i-th token in the sequence would have an attention scores of 0 for tokens i + 1, i + 2, and so on. Practically, masked self-attention prevents us from looking forward in the sequence when computing a token’s representation.

(5) Computing the output: From here, we can compute the output of masked self-attention by taking the matrix product of the attention matrix and a matrix of value vectors. This operation computes the output for the i-th token by taking a weighted combination of all value vectors, where the weights are given by token i’s attention scores.Image An implementation of masked self-attention in PyTorch (derived from NanoGPT by Andrej Karpathy) is provided below. As we can see, the implementation of masked self-attention is easy to follow if we understand the concepts behind the computation!

gist.github.com/wolfecameron/d…
Mar 4 4 tweets 3 min read
New language models get released every day (Gemini-1.5, Gemma, Claude 3, potentially GPT-5 etc. etc.), but one component of LLMs has remained constant over the last few years—the decoder-only transformer architecture. This architecture has five components…

Why should we care? Research on LLMs moves fast. Shockingly, however, the architecture used by most modern LLMs is pretty similar to that of the original GPT model. We just make the model much larger, modify it slightly, and use a more extensive training (and alignment) process. For this reason, the decoder-only transformer architecture is one of the most fundamental/important ideas in AI research, so investing into understanding it deeply is a wise idea.

(1) Input layer: Decoder-only transformers receive a textual prompt as input. We use a tokenizer—based upon an algorithm like Byte-Pair Encoding (BPE)—to break this text into discrete tokens (i.e., words or sub-words). Then, we map each of these tokens to a corresponding vector stored in an embedding layer. This process forms a sequence of token vectors that are passed to the model as input. Optionally, we can augment these token vectors with additive positional embeddings.

(2) Causal self-attention is the core of the decoder-only transformer and allows the model to learn from relationships between tokens in the input. The vanilla self-attention operation transforms each token’s representation by taking a weighted combination of other token representations, where weights are given by pairwise attention/importance scores between tokens. Causal self-attention follows a similar strategy but only computes attention scores for preceding tokens in the sequence. Attention is performed in parallel across several heads (i.e., multi-head attention), each of which can focus upon different parts of the input sequence.

(3) Feed-forward transformations are performed within each block of the decoder-only transformer, allowing us to individually transform each token’s representation. This feed-forward component is a small neural network that is applied in a pointwise manner to each token vector. Given a token vector as input, we pass this vector through a linear projection that increases its size by ~4X, apply a non-linear activation function (e.g., SwiGLU or GeLU), then perform another linear projection that restores the original size of the token vector.

(4) Classification head: The decoder-only transformer has one final classification head that takes token vectors from the transformer’s final output layer as input and outputs a vector with the same size as the vocabulary of the model’s tokenizer. This vector can be used to either train the LLM via next token prediction or generate text at inference time via sampling strategies like nucleus sampling and beam search.

(5) Transformer blocks form the body of the decoder-only transformer architecture. The exact layout of the decoder-only transformer block may change depending upon the implementation, but two primary sub-layers are always present:

1. Causal self-attention
2. Feed-forward transformation

Additionally, these sub-layers are surrounded by a layer normalization module—either before or after the sub-layer (or both!)—as well as a residual connection.Image I also wrote an explainer of the decoder-only transformer a long time ago, right around when I first began writing publicly-available content about deep learning. Check it out below.

Mar 1 4 tweets 7 min read
There are a ton of different ways to finetune a language model. Here's a (brief) summary of language model finetuning, the various approaches that exist, their purpose, and what we know about how they work...

Finetuning techniques: The term “finetuning” simply refers to further training a pretrained model. In the case of LLMs, this means that we take a pretrained foundation model and train it some more. But, there are so many different ways that this training can be done, which makes the concept of finetuning incredibly vague. This single term can refer to a variety of different techniques, such as:

- Continued pretraining
- Instruction tuning
- Supervised finetuning (SFT)
- Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO)

What is the goal of these techniques? For language models, there are two primary goals that a practitioner will have when performing finetuning:

1. Knowledge injection: teach the model how to leverage new sources of knowledge (not present during pretraining) when solving problems.
2. Alignment (or style/format specification): modify the way in which the language model surfaces its existing knowledge base; e.g., abide by a certain answer format, use a new style/tone of voice, avoid outputting incorrect information, and more.

Given this information, we might wonder: Which finetuning techniques should we use to accomplish either (or both) of these goals? To answer this question, we need to take a much deeper look at recent research on the topic of finetuning.

Large-scale instruction tuning: Prior to the release of modern open-source LLMs, it was very common to finetune pretrained LLMs on massive instruction tuning datasets. Such an approach was popularized by models like FLAN [1] (from Google), which perform instruction tuning of pretrained language models over large datasets. In the case of FLAN, for example, the FLANv2 instruction tuning dataset contains over 15M examples—very large! By following this approach, FLAN can learn to solve a large number of different downstream tasks in an efficient manner.

“We show that by training a model on these instructions it not only becomes good at solving the kinds of instructions it has seen during training but becomes good at following instructions in general.” - from FLAN paper [1]

Beyond knowledge injection: After the proposal of ChatGPT, we saw an increase in the desire to align language models and adapt their output format to a particular style or structure. Such a goal is drastically different than teaching an LLM to solve a new task. When we are trying to teach an LLM new knowledge, more data is always better (hence the large instruction tuning datasets used by models like FLAN). However, aligning the language model to a certain style or structure of output does not require learning new information! So, maybe alignment-focused goals require less extensive finetuning.

Less is more for alignment: Research on the topic of LLM finetuning was catalyzed by the release of LLaMA [2] (and later LLaMA-2 [3]), which made high-quality foundation LLMs openly available. Quickly after LLaMA, authors from Meta published LIMA [4], which showed that alignment-style finetuning can be accomplished with very little data. Namely, the goal of alignment is to adapt the LLM’s style (rather than to learn new information), which can be accomplished via a small, high-quality, and diverse finetuning dataset. Such findings revealed that most of an LLM’s knowledge comes from pretraining, and the LLM learns the correct style during alignment (see quote below).

“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.” - from LIMA paper [4]

Imitating proprietary LLMs: Following LIMA, a massive number of high-quality, finetuned LLMs (e.g., Alpaca, Vicuna, Koala, Orca, and more) were created by finetuning LLaMA over small synthetic finetuning datasets of GPT-3.5/4 outputs. In this way, we could train these models to imitate the output of more powerful LLMs. When evaluated in human trials and on simplistic benchmarks, these models seemed to match (or exceed) the performance of powerful models like ChatGPT. For this reason, practitioners began to believe that we could surpass models like GPT-4 or ChatGPT by performing a small amount of (inexpensive) finetuning.

What is going on here? Obviously, training a model like ChatGPT cannot be done this easily. Researchers quickly found some limitations in the work done on imitation models [5]:

- Humans are easily tricked if the style of the LLM is good, and (as shown by LIMA) these models can quickly learn to mimic the style of models like ChatGPT with little data.
- The benchmarks that were used are too limited. The models perform well when evaluated by a small group of humans, but their performance falls apart on more extensive benchmarks that include traditional, perplexity-based evaluations (e.g., normal NLP benchmarks).

We can learn certain things (e.g., style and output format) from finetuning over a small amount of data, but we can’t learn everything! These imitation models lack the knowledge base of more powerful LLMs, which can only be learned from large amounts of data.

Putting everything together: Given all of the information we’ve covered so far, there are a few takeaways that we can deduce:

- Most knowledge from an LLM comes from pretraining.
- We can perform finetuning in the form of continued pretraining to expose the LLM to more (and new) data/knowledge.
- Alignment-focused objectives can be achieved via finetuning (SFT) on small, high-quality datasets. We don’t need to tons of data to learn style or format of output, only to learn new knowledge.

When performing finetuning, it’s very important that we know which goal—either alignment or knowledge injection—that we are aiming for. Then, we should put benchmarks in place that allow us to accurately and comprehensively assess whether that goal was accomplished or not. Imitation models failed to do this, which led to a bunch of misleading claims/results!

Ongoing work: The story doesn’t stop here! In fact, the distinction between pretraining and finetuning is still quite vague. At what point does the LLM start actually learning new knowledge instead of just learning style/alignment? Many recent publications are continuing to study this question:

- Finetuning vs. RAG [6]: authors find that continued pretraining is not super effective at knowledge injection, while RAG is actually highly effective at specializing an LLM to a new knowledge base.
- LIMIT [7]: authors from MosiacML/Databricks show that we can perform finetuning over a small mixture of instruction tuning and alignment-focused data, leading to a model that performs well in both NLP benchmarks and style-focused evaluations.
- TULU [8]: authors subject finetuned LLMs to broader evaluations, finding that the quality of the base model has a massive impact on performance and that no one finetuning dataset/strategy yields the best results across all benchmarks.
- TULU-2 [9]: authors show that finetuning LLMs over specific datasets leads to the model learning specific skills and domains of data. Finetuning works well if we make sure the finetuning dataset is highly relevant to the style/domain of evaluation we are using.
- AlpaGasus [10]: authors directly study how much finetuning data is necessary for an LLM to perform well on various downstream tasks.

-------- Bibliography --------
[1] Wei, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).
[2] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
[3] Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023).
[4] Zhou, Chunting, et al. "Lima: Less is more for alignment." Advances in Neural Information Processing Systems 36 (2024).
[5] Gudibande, Arnav, et al. "The false promise of imitating proprietary llms." arXiv preprint arXiv:2305.15717 (2023).
[6] Ovadia, Oded, et al. "Fine-tuning or retrieval? comparing knowledge injection in llms." arXiv preprint arXiv:2312.05934 (2023).
[7] Jha, Aditi, et al. "LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms." arXiv preprint arXiv:2311.13133 (2023).
[8] Wang, Yizhong, et al. "How far can camels go? exploring the state of instruction tuning on open resources." Advances in Neural Information Processing Systems 36 (2024).
[9] Ivison, Hamish, et al. "Camels in a changing climate: Enhancing lm adaptation with tulu 2." arXiv preprint arXiv:2311.10702 (2023).
[10] Chen, Lichang, et al. "Alpagasus: Training a better alpaca with fewer data." arXiv preprint arXiv:2307.08701 (2023).Image More interesting info on the superficial alignment hypothesis is provided below. This paper deeply studies the impact of alignment on LLMs, as well as proposes a "tuning-free" approach to alignment (maybe alignment literally requires no finetuning).

Feb 27 4 tweets 4 min read
Open-source LLMs have become a hot research topic in recent weeks with several new models being released from top research groups. Here’s a quick summary of recently-released models and their contributions…

(1) OLMo is a suite of 1B and 7B parameter LLMs that are pretrained on the Dolma corpus and released by AI2. Whereas many open-source LLMs vary in their definition of open, OLMo makes a commitment towards complete transparency. All details of the model architecture, its pretraining data, and the training process are outlined in two technical reports for both Dolma and OLMo. Going further, several artifacts are released (under a permissive license) for reproducing this work, including:

- The full pretraining dataset.
- A code toolkit for constructing (and modifying) the pretraining dataset.
- All training/adaptation/evaluation code.
- Training logs (in weights and biases).

OLMo models don’t set new state-of-the-art performance, but they are competitive with other models. Put simply, the purpose of OLMo is to provide a transparent view of how LLM pretraining works that others can easily build upon.

(2) Gemma is a recently released suite of open-source LLMs—including 2B and 7B models—from Google. These models were accompanied by an insightful technical report that reveals interesting architecture choices made by Gemma (e.g., normalizing every sub-layer of the transformer and using a massive vocabulary), as well as details the alignment strategy that is adopted. Although details about the training data are obfuscated, Gemma is trained over a 6T token corpus, which is much larger than prior open models (e.g., OLMo is trained on ~2T tokens). In benchmarks, Gemma performs quite favorably to other open LLMs, making Google (once again) a key player in the space of open-source AI/ML.

Notably, several practitioners have pointed out that, despite the strong performance of Gemma on benchmarks, the model lags behind top models like Mistral in applications. Put simply, Gemma fails to pass the “vibe check”, and its performance is not quite as impressive as benchmarks indicate. Nonetheless, Google will undoubtedly improve upon this model, and it’s amazing to see another massive company dipping their toes into the open-source LLM landscape.

(3) Mistral Large: Although the previously released Mistral-7B and Mistral open-source LLMs are already incredibly popular, Mistral built upon this popularity with the recent release of Mistral Large. Although very few technical details are available about Mistral Large in the announcement, this model is an extension to the existing Mistral-Small/Medium models that are already available in the Mistral platform (called “La Plateforme”). Some notable details about Mistral Large include:

- Natively fluent in English, French, Spanish, German, and Italian (same as Mixtral).
- Has a 32K context window and precise information recall in this extended context (great for RAG!).
- Instruction following capabilities are improved compared to prior models.
- Possesses function calling capabilities (and a JSON format mode to ensure that valid JSON is outputted).
- Strong at coding and math benchmarks, as well as multilingual reasoning (comparable to LLaMA-2/GPT-3.5 on other tasks).

In addition to being available through the Mistral platform, Mistral Large is available via Microsoft Azure. The release of this model was accompanied by the announcement of a partnership between Mistral and Microsoft. This partnership has received some backlash given that Mistral has a well-known commitment to building open and independent technology (see below).

“Our mission is to make frontier AI ubiquitous, and to provide tailor-made AI to all the builders. This requires fierce independence, strong commitment to open, portable and customisable solutions, and an extreme focus on shipping the most advanced technology in limited time.” - Mistral Mission StatementImage For more details on OLMo and Dolma, check out my prior post below. This model (and the information, tools, and data that come with it) is incredibly important for anyone looking to better understand the LLM pretraining process.

Feb 20 4 tweets 4 min read
Most LLM research became proprietary after the proposal of ChatGPT. Open-source LLMs have been explored, but many important details are withheld and model usage may be restricted. Recently, however, the area of open-source LLMs changed drastically…

TL;DR: Recently, AI2 released Dolma and OLMo. Dolma is a pretraining corpus for LLMs, while OLMo is a suite LLMs pretrained on Dolma. Although this might not sound notable, both Dolma and OLMo are fully open-source, meaning that authors from AI2 publicly released all information (and tools) necessary to reproduce Dolma and OLMo! As such, Dolma and OLMo provide a transparent (and detailed) view of how to successfully pretrain an LLM.

Details on Dolma. Dolma is a completely open pretraining dataset for LLMs with 3T English tokens. The process of creating this dataset is documented extensively in [1], where authors i) describe different filtering and curation choices for Dolma and ii) measure their impact on model performance. Such analysis is usually withheld from public view, but Dolma provides extensive insight into the process of creating a pretraining corpus. Additionally, tools are provided for further studying this topic and exploring new methods of data curation!

“We create a high-performance toolkit to facilitate efficient processing on hundreds of terabytes of text content. The toolkit is designed for high portability: it can run any platform from consumer hardware to a distributed cluster environment” - from [1]

Dolma artifacts. Artifacts released with Dolma include:

- The full pretraining dataset (3T tokens in total)
- A data toolkit for reproducing (and modifying) the dataset

Notably, the data toolkit can be run both locally and on a cluster (to support large-scale data curation jobs for other LLMs).

Details on OLMo. OLMo is a suite of decoder-only transformer models, including both 1B and 7B parameter models, that have been pretrained on a 2T token subset of DoLMa. The exact architecture and process of training (and evaluating) these models is explained with great detail in [2]. Though the models perform competitively compared to other “open” LLMs, OLMo does not surpass state-of-the-art performance—this is not the model’s goal. Rather, the models are meant to provide a fully-documented starting point for LLM pretraining research.

OLMo artifacts. Artifacts release with OLMo include:

- Full model weights
- Inference/Pretraining/Adaptation/Evaluation Code
- Model checkpoints from every 1K training steps
- Training logs (from weights and biases)

Notably, 65B parameter version of OLMo is still being trained and will be released some time in the (near) future.

Licensing. All OLMo model artifacts are released under an Apache 2.0 license, while Dolma is released under the AI2 impact license. Additionally, the reports for Dolma and OLMo [1, 2] provide a significant amount of detail compared to the reports of other “open” LLMs (e.g., Mistral or LLaMA-2). As such, Dolma/OLMo represent a drastic improvement in the usability and overall transparency of LLMs and how they are created!

————
[1] Soldaini, Luca, et al. "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." arXiv preprint arXiv:2402.00159 (2024).
[2] Groeneveld, Dirk, et al. "OLMo: Accelerating the Science of Language Models." arXiv preprint arXiv:2402.00838 (2024).Image See below for links to the Dolma and OLMo papers. They're arguably the two most useful papers ever published if you're looking for information on how to pretrain your own LLM.

-
- arxiv.org/abs/2402.00159
arxiv.org/abs/2402.00838
Feb 15 4 tweets 3 min read
Having the ability to clearly explain fundamental concepts in AI to others is incredibly important. To explain large language models (LLMs), I use a simple three-part framework…

Why is this important? Given that most AI engineers/researchers work on teams with highly-technical members, they might not get a lot of opportunities to explain concepts like the transformer architecture or alignment to those who are non-technical. However, such an ability is incredibly important as political leaders are crafting legislation for AI and AI-powered tools are becoming more widely utilized in the broader public.

(1) Transformer: Modern language model are based upon the transformer architecture—a type of deep neural network that can text as input and produce text as output. The transformer has two components—encoder and decoder—but LLMs use a decoder-only architecture, which only has a decoder. This model takes a textual sequence as input and repeatedly performs two operations:

- Masked self-attention: each word looks at prior words in the sequence.
- Feed-forward transformation: each word is individually transformed.

Together, these two operations allow the transformation to learn meaningful relationships between words across an entire sequence of text to produce the correct textual output.

(2) Pretraining: All language models rely upon the next word/token prediction objective at their core. This objective is quite simple! Given a large corpus of text downloaded from the web, we just train the LLM by:

1. Sampling some text from the corpus.
2. Ingesting the text sequence with the decoder-only transformer.
3. Training the model to correctly predict each word in the sequence given preceding words as input.

This self-supervised objective works great for pretraining LLMs, as we can efficiently train the model over a large amount of unlabeled text, allowing the LLM to amass a large knowledge base.

Extra tip for Pretraining: Next word/token prediction is also used to generate text with an LLM. Starting with an input sequence (i.e., the prompt), we just continually predict the next word, add it to the input sequence, predict the next word, and so on.

(3) Alignment: Pretraining teaches the LLM to be really good at predicting the most likely next word, given preceding words as input. But, what we actually want is an LLM that produces and interesting and useful output. For this, we need to align the model, or train it in a way that encourages it to generate outputs that better align with the desires of a human user. To do this, we use to finetuning techniques:

- Supervised finetuning (SFT): finetune the model on examples of desirable outputs.
- Reinforcement Learning from Human Feedback (RLHF): finetune the model on pairs of model outputs, where the “better” output is ranked by a human annotator.

Extra tip for Alignment: Typically, we define a set of alignment criteria (e.g., helpful, harmless, factual, etc.) to guide the alignment process. The alignment criteria are given to human annotators and capture the core properties of desirable output from the LLM. If you're interested in this topic, I'm giving a (very brief) presentation on it for the "AI in Production" virtual conference for the MLOps community! See below for more info. The presentation is today (Feb 15th) from 11:00 am - 11:10 am PST. Image
Feb 5 4 tweets 4 min read
RAG is one of the best (and easiest) ways to specialize an LLM over your own data, but successfully applying RAG in practice involves more than just stitching together pretrained models…

What is RAG? At the highest level, RAG is a combination of a pretrained LLM with an external (searchable) knowledge base. At inference time, we can search for relevant textual context within this knowledge base and add it to the LLM’s prompt. Then, the LLM can use its in context learning abilities to leverage this added context and produce a more factual/grounded output.

Simple implementation. We can create a minimal RAG pipeline using a pretrained embedding model and LLM by:

1. Separating the knowledge base into fixed-size chunks.
2. Vectorizing each chunk with an embedding model.
3. Vectorizing the input/query at inference time and using vector search to find relevant chunks.
4. Adding relevant chunks into the LLM’s prompt.

This simple approach works, but building a high-performing RAG application requires much more. Here are five avenues we can follow to refine our RAG pipeline.

(1) Hybrid Search: At the end of the day, the retrieval component of RAG is just a search engine. So, we can drastically improve retrieval by using ideas from search. For example, we can perform both lexical and vector retrieval (i.e., hybrid retrieval), as well as re-ranking via a cross-encoder to retrieve the most relevant data.

(2) Cleaning the data: The data used for RAG may come from several sources with different formats (e.g., pdf, markdown and more), which could lead to artifacts (e.g., logos, icons, special symbols, and code blocks) that could confuse the LLM. We can solve this by creating a data preprocessing or cleaning pipeline (either manually or by using LLM-as-a-judge) that properly standardizes, filters, and extracts data for RAG.

(3) Prompt engineering: Successfully applying RAG is not just a matter of retrieving the correct context—prompt engineering plays a massive role. Once we have the relevant data, we must craft a prompt that i) includes this context and ii) formats it in a way that elicits a grounded output from the LLM. First, we need an LLM with a sufficiently large context window. Then, we can adopt strategies like diversity and lost-in-the-middle selection to ensure the context is properly incorporated into the prompt.

(4) Evaluation: We must also implement repeatable and accurate evaluation pipelines for RAG that capture the performance of the whole system, as well as its individual components. We can evaluate the retrieval pipeline using typical search metrics (DCG and nDCG), while the generation component of RAG can be evaluated with an LLM-as-a-judge approach. To evaluate the full RAG pipeline, we can also leverage systems like RAGAS.

(5) Data collection: As soon as we deploy our RAG application, we should begin collecting data that can be used to improve the application. For example, we can finetune retrieval models over pairs of input queries with relevant textual chunks, finetune the LLM over high-quality outputs, or even run AB tests to quantitatively measure if changes to our RAG pipeline benefit performance.

What’s next? Beyond the ideas explored above, there are a variety of avenues that exist for improving RAG. Once we have implemented a robust evaluation suite, we can test a variety of improvements using both offline metrics and online AB tests. Our approach to RAG should mature (and improve!) over time as we test new ideas.Image RAG was originally proposed in this paper:

However, language models have evolved a lot since then. For a more modern overview of RAG for LLMs, check out the survey below!

arxiv.org/abs/2005.11401
arxiv.org/abs/2312.10997
Feb 2 4 tweets 4 min read
One of the most defining characteristics of large language models (LLMs) is their in context learning abilities, commonly defined as the ability to use information in the context window to generate better output. But, where do these abilities come from?

TL;DR: In context learning is an emergent ability of LLMs. We first discovered this ability with the proposal of GPT-3 and further refined the skill via modern finetuning and alignment techniques.

What is in context learning? In layman’s terms, in context learning refers to the ability of a single foundation LLM to solve a variety of different downstream tasks accurately by leveraging information provided in the prompt. For example, we can pass examples of the task being solved in the prompt (i.e., few-shot learning), or even provide a detailed instruction that describes how to solve the task.

(Phase 1) Where we started… Early variants of modern, decoder-only LLMs (e.g., GPT and GPT-2) did not usually solve problems via a prompting approach. Rather, we would either finetune the model to solve each task or perform zero-shot learning. In both cases, the model is not expected to significantly leverage information within its context window or prompt (other than a basic task description or input) and, therefore, is not relying upon in context learning.

(Phase 2) LLMs are few-show learners… With the proposal of GPT-3, we saw for the first time that in context learning is an emergent ability of LLMs. In other words, these models naturally become better at performing in context learning as we increase the size (and amount of pretraining data) of the underlying model. In particular, GPT-3 was found to be highly effective at solving a wide range of tasks via few-shot learning.

(Phase 3) Improving LLM alignment… Although LLMs have emergent in context learning abilities, their ability to leverage information in the prompt is not perfect out of the box. Modern LLM applications—especially those that use retrieval augmented generation (RAG)—rely heavily upon injecting extensive knowledge into the model’s prompt. To improve the ability of these models to leverage this information (and follow instructions), we typically rely on the alignment process (via SFT and RLHF/DPO), as originally explored by InstructGPT.

Ongoing efforts: Even aligned LLMs have limitations in their ability to perform in context learning. For example, early variants of GPT-3.5-Turbo had warning messages for users claiming that the model has a tendency to ignore the system message! Put simply, teaching an LLM to better leverage information in its context is an ongoing process that requires ongoing finetuning and alignment for your application of choice.Image Some of the notable papers mentioned in the post above include:
- GPT:
- GPT-2:
- GPT-3:
- InstructGPT: s3-us-west-2.amazonaws.com/openai-assets/…
d4mucfpksywv.cloudfront.net/better-languag…
arxiv.org/abs/2005.14165
arxiv.org/abs/2203.02155
Jan 30 4 tweets 3 min read
What’s the easiest way to specialize an LLM over your own data? Recent research has studied this problem in depth, and RAG is way more effective (and easier to implement) compared to extended pretraining or finetuning…

Knowledge from pretraining. A lot of factual information is inherently present within an LLM’s pretrained weights, but the knowledge possessed by these models is highly dependent upon the characteristics of their pretraining data. Unfortunately, this means that—at least in the current paradigm of LLMs—the knowledge base of these models is static (e.g., ChatGPT has a knowledge cutoff date) and may lack detailed information.

Knowledge injection. Given a pretrained LLM, there are two postprocessing techniques that we can use for injecting new data into the LLM’s knowledge base:

- Finetuning: continuing the model’s pretraining process over a smaller, domain-specialized corpus of new information.
- Retrieval Augmented Generation (RAG): modifying the LLM’s input query by retrieving relevant information that can be leveraged by the model via in-context learning to generate a more grounded/factual output.

The variant of finetuning referenced above is a continued pretraining style of finetuning, where a next token prediction objective is used to further train a pretrained model over a specialized corpus of text. In contrast, SFT and RLHF emphasize the quality of model responses rather than improving the LLM’s breadth of knowledge.

“Given some knowledge base in the form of a text corpus, what is the best way to teach a pre-trained model this knowledge?” - from [1]

Recent research. In [1], authors compare RAG and finetuning to determine the superior knowledge injection approach. The RAG setup uses vector search to retrieve relevant document chunks to include in the model’s prompt. Given a corpus of information, we can:

1. Divide this corpus into chunks of text.
2. Use an embedding model (e.g., bge-large-en) to generate a dense vector for each chunk of text.
3. Search for relevant chunks by embedding the model’s input and performing a vector search.
4. Add relevant chunk’s into the model’s prompt.

What do we learn? While finetuning does improve model performance, RAG consistently outperforms finetuning for the injection of both new and previously encountered knowledge. Put simply, LLMs struggle to learn new information through finetuning. Though finetuning does yield a benefit in performance relative to the base model, RAG has a significant advantage over finetuning. Combining RAG with finetuning—though effective in some cases—does not consistently benefit performance.

Finetuning with paraphrases. We can improve the performance of finetuning for knowledge injection by training the model over several different paraphrases of the same information. In order to teach an LLM new information via finetuning, we must repeat this information in numerous ways.

——
[1] Ovadia, Oded, et al. "Fine-tuning or retrieval? comparing knowledge injection in llms." arXiv preprint arXiv:2312.05934 (2023).Image Here's a link to the actual paper for those interested in reading more deeply about this topic.

arxiv.org/abs/2312.05934
Jan 19 4 tweets 4 min read
The impressive in-context learning abilities of LLMs has created the need for larger context windows. Recently, researchers discovered that we can easily extend the context window of a pretrained LLM with one simple trick (and no extra training)…

What is the context window? During pretraining, an LLM sees input sequences of a particular length. This choice of sequence length during pretraining becomes the model’s context length, or the maximum-length sequence of text that the model can process. Beyond this context length, the model may behave unpredictably and produce incorrect output.

Why do we need a large context window? Practitioners want the ability to pass more data into the LLM’s context window to enable more complex applications via approaches like few-shot learning (or even more complex prompting approaches like chain of thought prompting) and retrieval augmented generation (RAG). Although several long-context LLMs have been released (e.g., Claude 2.1 and GPT-4-Turbo), not all LLMs have been trained to support long context, and open-source LLMs tend to only support shorter contexts compared to their proprietary alternatives.

Extending the context window. To extend a pretrained LLM’s context window, we could finetune the model over examples of longer sequences, but such an approach may cause the model to overfit to specific examples of long sequences. Several approaches have been proposed for extending an LLM’s context window with no (or minimal) finetuning as well, including PI, CLEX, and YARN. Plus, commonly-used approaches like ALiBi and RoPE enable LLMs to handle longer inputs during inference than those seen during training.

Why can’t LLMs generalize to longer sequences? The key issue faced by LLMs in generalizing to longer context windows is related to out-of-distribution positional encodings, where the LLM is exposed to relative distances and token positions that exceed what was seen during training. We can easily address this issue by simply remapping unseen positions to positions that have been encountered during training.

“To address this, an intuitive and practical solution would be to remap the unseen relative positions to those encountered during the pretraining, thus extending the LLMs’ ability to handle longer contexts naturally.” - from Self Extend paper

Grouped Attention. In “LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning”, authors argue that LLMs have an inherent ability to handle long sequences that can be leveraged without extra training. We can use a FLOOR operation that performs integer division on position indices such that the maximum position index seen during inference does not exceed the model’s predefined context length. Although this can cause adjacent tokens to be assigned to the same position index, it still works well in practice because:

1. Precise token position is less important than relative ordering when trying to understand a sequence of text.
2. Short sequences of tokens tend to only have one valid ordering, so assigning them to the same position index has little practical impact.

Such an approach, called “grouped attention” because we group several tokens to the same position embedding, performs comparably to finetuning techniques for extending the context window and only requires minimal code modifications (i.e., only four extra lines in PyTorch).

Self Extend. If we naively apply grouped attention, language modeling performance deteriorates slightly, as tokens throughout the entire sequences are mapped into groups that share the same position index. To solve this issue, we need to realize that neighboring tokens are most important with generating a token with an LLM. So, we can eliminate this performance degradation by:

1. Defining a neighborhood size of the most recent tokens over which normal attention is applied.
2. Using grouped attention for tokens that are further away within the sequence.

This one final trick forms the Self Extend technique, which can be used to increase the context length of any LLM at inference time without the need for finetuning.Image Also, one of the main reasons that I love this paper is that it was written by authors from my alma mater (Rice University), as well as Texas A&M. This is a great example of how academic research can provide practical and useful techniques in the era of LLMs. Image
Jan 12 4 tweets 3 min read
Generative large language models (LLMs) are based upon the decoder-only transformer architecture. Currently, these types of generative LLMs are incredibly popular. However, I use encoder-only architectures for 90% of use cases as a practitioner. Here’s why…

History of encoder-only models. The encoder-only transformer architecture was popularized by the proposal of BERT in 2018. At the time of its proposal, BERT set a new state-of-the-art performance on every natural language task that was considered in its publication. For this reason, BERT revolutionized research in natural language processing, replacing many domain-specific techniques with a single model that can solve nearly all tasks!

Encoder-only architecture. Although the original transformer architecture contains both an encoder and a decoder, BERT leverages an encoder-only architecture. The encoder-only architecture just contains several repeated layers of bidirectional self-attention and a feed-forward transformation, both followed by a residual connection and layer normalization. The original encoder-only BERT models that were proposed have the following sizes:

- BERT Base: 12 layers, 768-dimensional hidden representations, 12 attention heads in each self-attention module, and 110M parameters.
- BERT Large: 24 layers, 1024-dimensional hidden representations, 16 attention heads in each self-attention module, and 340M parameters.

Notably, BERT Base is the same size as the original GPT model. In other words, these models are significantly smaller (and therefore easier to manage/deploy!) compared to the generative LLMs that are popular today.

BERT pretraining. Similar to generative LLMs, BERT has an extensive pretraining process. Instead of next token prediction, however, we pretrain BERT using a Cloze objective, which randomly masks out words/tokens from the input and tries to predict them. Because BERT uses bidirectional self-attention (instead of masked self-attention, which is used by decoder-only models), the model can look at the entire sequence both before and after the masked token to make a prediction.

Using BERT in practice. To use BERT to solve a practical task, we simply finetune the model over task-specific data. In particular, BERT is very good at solving sentence and token-level classification tasks. Additionally, extensions of BERT (e.g., sBERT) can be used for semantic search, making BERT applicable to retrieval tasks as well. In general, finetuning BERT is easy/efficient and yields high performance even with small amounts of training data.

What can’t we do? Encoder-only (BERT) models are small, use bidirectional self-attention, and can be easily fine-tuned to impressive performance. As such, finetuning BERT to solve classification tasks is oftentimes preferable to performing few-shot prompting via an LLM, assuming we have the ability to train models and a little bit of training data. However, encoder-only models cannot generate text, so we can only use them for solving discriminative tasks.Image To learn more about the self-attention mechanism that is used by encoder-only transformer architectures, check out the post below. Bidirectional self-attention is the primary building block of encoder-only transformers!

Dec 11, 2023 6 tweets 5 min read
Looking for something to talk to your family about while you’re home for the holidays? Why not give them a clear, accessible explanation of ChatGPT? Here’s a simple, three-part framework that you can use to explain generative language models to (almost) anyone…

TL;DR: We can explain ChatGPT pretty easily by focusing on three core ideas.

1. Transformer architecture: the neural network architecture used by LLMs.
2. Language model pretraining: the (initial) training process used by LLMs.
3. The alignment process: how we teach LLMs to behave to our liking.

Although AI researchers might know these techniques well, it is important that we know how to explain them in simple terms as well! AI is no longer just a research topic, but rather a topic of public interest.

Why is this important? Generative AI has now become a popular topic among both researchers and the general public. Now more than ever before, it is important that researchers and engineers (i.e., those building the technology) develop an ability to communicate the nuances of their creations to others. A failure to communicate the technical aspects of AI in an understandable and accessible manner could lead to widespread public skepticism (e.g., research on nuclear energy went down a comparable path) or the enactment of overly-restrictive legislation that hinders forward progress in our field.

(1) Transformers: Most recent generative language models are based upon the transformer architecture. Although the transformer was originally proposed with two modules (i.e., an encoder and a decoder), generative LLMs use a decoder-only variant of this architecture. This architecture takes as input a sequence of tokens (i.e., words or subwords) that have been embedded into a corresponding vector representation and transforms them via two repeated operations:

- Masked self-attention: looks at other tokens in the sequence (i..e, those that precede the current token).
- Feed-forward transformation: transforms each token representation individually.

These two operations each play a distinct and crucial role. By stacking several blocks of masked self-attention and feed-forward transformations on top of each other, we get the neural network architecture that is used by most generative LLMs today.

(2) Pretraining: Self-supervised learning refers to the idea of using signals that are already present in raw data to train a machine learning model. In the case of generative language models, the most commonly-used objective for self-supervised learning is next token prediction, also known as the standard language modeling objective. Interestingly, this objective—despite being quite simple to understand—is the core of all generative language models. To pretrain a generative language model, we first curate a large corpus of raw text (e.g., from books, the web, scientific publications, and much more) to use as a dataset. Starting from a randomly initialized model, we then pretrain the LLM by iteratively performing the following steps:

1. Sample a sequence of raw text from the dataset.
2. Pass this textual sequence through the decoder-only transformer.
3. Train the model to accurately predict the next token at each position within the sequence.

“Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.” - from InstructGPT

(3) Alignment: After pretraining, the LLM can accurately perform next token prediction, but its output is oftentimes repetitive and uninteresting. The alignment process teaches a language model how to generate text that aligns with the desires of a human user. To align a language model, we first define a set of alignment criteria (e.g., helpful and harmless). To instill each of these alignment criteria within the model, we perform finetuning via supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF), which together form the three-step technique for alignment proposed by InstructGPT.

For more details on each of these three components, see the links in the replies.Image For more details on language model pretraining with next token prediction, how it works, and how it is implemented, check out the post below.

Dec 6, 2023 4 tweets 4 min read
I’ve spent the last ~5 years working on (and writing about) language models. The proposal of Google Gemini made me think about why I am so interested in these models. There are numerous reasons, but the allure of LLMs (at least for me) boils down to 3 core properties…

TL;DR: There are three properties of LLMs that (in my opinion) play the largest role in their success:
1. Their compatibility with self-supervised pretraining.
2. Their ability to solve many tasks at once.
3. Their ability to easily ingest multiple modalities of input.
When combined together, these three core properties largely explain the popularity and success of LLMs in modern AI research.

BERT proposal. My interest in language models began in 2018 with the proposal of BERT. Why was BERT so transformational? Well, I was previously a bit of a deep learning skeptic. In the language domain, I wasn’t convinced that deep learning provided us much of a benefit. However, BERT set a new SOTA performance on every notable NLP research task being studied at the time. You can’t argue with results like that! Almost overnight, deep learning (and language models) took over the domain of natural language processing.

Why are LLMs so popular? Obviously, the progression and popularization of language models did not end with BERT. Following this model, we got RoBERTa and T5, as well as a massive number of generative language models (e.g., ChatGPT and LLaMA) that became incredibly popular. Language models (both discriminative and generative variants) quickly became one of the most popular research topics within AI. There are several obvious reasons that language models have become so popular, such as their performance or ability to interface with humans through language. However, there are 3 properties in particular that (in my opinion) contribute most to their success.

(1) Compatibility with self-supervised learning: Without the ability to pretrain over raw/unlabeled text, language models would (arguably) be nothing. By pretraining over a ton of text from the internet, we can finetune these models to high performance on any downstream task at minimal cost. The exact objective used depends on the model (e.g., generative LLMs use next token prediction, while BERT-style models use Cloze), but language models are highly compatible with self-supervised learning objectives, which allows them to learn from large amounts of raw data. Without this, language models would not have their impressive knowledge base and ability to quickly solve tasks.

(2) Multi-tasking: Due to their size and massive pretraining corpus, modern language models (both BERT-style models and generative LLMs) have the ability to solve a shockingly-large number of tasks at once. We can do this via in-context learning (e.g., prompting GPT-4 to solve a task), or via finetuning. Put simply, pretrained language models are a starting point for solving virtually any task. Whether we finetune, use prompting, or take another approach, adapting these models to solve a new task requires minimal effort, assuming that we have access to a pretrained model.

(3) Multimodality: Finally, language models can accept pretty much any form of input! Namely, the transformer’s input is just a list of vectors. Any data that we can compress into a vector and model as a list/sequence can be used as input! For example, we can pass in image (patch) embeddings, audio embeddings, video embeddings, and much more. The transformer is an incredibly versatile architecture that we can use for ingesting and learning patterns in almost any kind of data.

Putting it all together. To see all of these three ideas in action, check out one of the first papers that I wrote on language models. In this paper, I use a multi-modal BERT model to solve over 1,000 classification tasks at once. Not only is the model capable of doing this, but we see improved performance from learning all of these tasks together! Although this paper is nothing special (very early in my research career), it demonstrates how these three properties combine into something really cool. We use a pretrained language model to ingest multiple modalities of input data and solve thousands of different tasks with a single model.
Image Also, the thoughts behind why I like language models so much were partially inspired by my newsletter recently passing 10K subscribers. Thanks to everyone who reads it and gives me a reason to continue working on and thinking about these models! Image
Nov 29, 2023 4 tweets 4 min read
Reinforcement learning from human feedback (RLHF) is a major catalyst of the recent generative AI boom, as it enables language models to surpass human writing quality. RLHF makes this possible by improving the alignment process in three main ways...

What is RLHF? RLHF is a finetuning technique that can be used to align pretrained LLMs based on feedback from human users. Typically, RLHF is applied in tandem with SFT—the model obtained after pretraining and SFT serves as a starting point for RLHF—and follows three major steps:
1. Collect human comparisons: human feedback is collected prior to each round of RLHF. The dataset contains prompts, several (LLM-generated) responses to each prompt, and a ranking of these responses based on human preference.
2. Train a reward model: a reward model is trained over the dataset of human comparisons to accurately predict/automate human preference scores.
3. Optimize a policy according to the reward model: the policy (i.e., the LLM) is finetuned using reinforcement learning—PPO in particular—to maximize human preferences scores generated by the reward model.

What’s the alternative? Prior to RLHF, most LLMs (including both generative LLMs like GPT-2 and models like BERT or T5) were trained using supervised transfer learning. Models are first pretrained over a large textual corpus (using self-supervised learning), then finetuned (in a supervised manner) to solve a particular task. This approach works well for a variety of tasks, but it falls short on teaching models to generate text. RLHF improves upon this approach in several ways, as explained below.

(1) Human annotation process. RLHF generates responses to prompts automatically via an LLM and simply asks the human annotator to rank several responses to the same prompt, which is much easier than writing a response manually (this is required in supervised learning). Because ranking outputs is a much easier task compared to writing outputs from scratch, the annotation strategy for RLHF lessens the cognitive load of human annotators, which leads to several notable benefits:
- Annotations are higher quality (i.e., more accurate).
- Collecting annotations is faster and more efficient.
- Individual annotations can be focused upon a particular alignment principle.

“During annotation, the model has the potential to venture into writing trajectories that even the best annotators may not chart. Nonetheless, humans can still provide valuable feedback when comparing two answers, beyond their own writing competencies.” - from LLaMA-2 paper

(2) Beyond human quality. All responses used for collecting comparison data within RLHF are generated automatically by an LLM. This means that RLHF can train an LLM over responses that go beyond the writing capabilities of human annotators and, therefore, has the potential to surpass human performance. In comparison, supervised learning is constrained to the quality of responses that are manually written by human annotators.

(3) Accurately modeling response quality. The reward model created by RLHF is surprisingly accurate at judging the quality of an LLM’s response. Compared to automatic metrics like ROUGE, reward models provide a more consistent and accurate evaluation of model output quality, as judged by the agreement rate with human annotators. The ability of the reward model to accurately quantify response quality makes RLHF incredibly effective, as the LLM is finetuned to maximize scores provided by the reward model.
Image For more info on RLHF in general, check out my prior post on this topic, which outlines the history of RLHF and the role that it played in the recent generative AI boom.
Nov 14, 2023 4 tweets 4 min read
I just wrote a long-form overview of RLHF, its origins/motivation, and the impact it has had on the generative AI movement. My conclusion? RLHF is (arguably) the key advancement that made modern generative LLMs possible. Here’s why…

TL;DR: Prior to RLHF, we primary relied upon supervised learning to train generative LLMs. This requires a difficult annotation process (writing example responses from scratch), and the training objective doesn’t allow us to directly optimize the model based on output quality. RLHF allows us to directly learn from human feedback, making alignment easier and more effective.

How are generative LLMs trained? Most generative LLMs are trained via a pipeline that includes pretraining, SFT, RLHF, and (maybe) some additional finetuning. Using this pipeline, we build an expansive knowledge base within the LLM (handled by pretraining), then align the model to surface this knowledge in a manner that's preferable to humans.

Before RLHF. The RLHF component of LLM training is a (relatively) new addition. Prior models (e.g., BERT or T5) underwent a simpler transfer learning pipeline with pretraining and (supervised) finetuning. This works well for discriminative tasks like classification, but it falls short for teaching models to preperly generate interesting/useful text. We need something more!

Limitations of supervised learning. Training language models to generate text in a supervised manner requires us to manually write examples of desirable outputs and train the model over these examples. This approach has a fundamental limitation—there is a misalignment between the supervised training objective and what we actually want! We train the model to maximize the probability of human written generations, but what we want is a model that produces high-quality outputs.

The role of RLHF. With RLHF, humans can just identify which outputs from the LLM that they prefer, and RLHF will finetune the model based on this feedback. We do this via the following steps:

1. Ask humans to rank LLM outputs based on their preference.
2. Train a reward model to predict human preference scores.
3. Optimize the LLM with a reinforcement learning algorithm (like PPO) to maximize human preference.

RLHF makes alignment simple and allows us to create more useful/helpful/safe AI systems.

Empirical results. In the context of generative LLMs, RLHF was first used in the abstractive summarization and alignment domains, where we see that it allows us to train smaller LLMs to outperform models that are 10X their size. Work on InstructGPT demonstrated that RLHF is an impactful alignment technique that makes LLMs better at following instructions, obeying constraints, avoiding harmful output and more. RLHF made alignment significantly easier and more effective, ultimately leading to the proposal of models like ChatGPT.

Modern variants. Despite the massive impact of RLHF, recent research is still trying to make it even better, leading to modified variants of RLHF and entirely new algorithms for LLM alignment. Notable examples include techniques like Safe RLHF, Pairwise PPO, RLAIF, direct preference optimization (DPO), and more.
Image Here are two papers that initially explored RLHF w/ LLMs:

- Summarization w/ RLHF:
- InstructGPT:

InstructGPT is the sister model / predecessor to ChatGPT and a great resource for understanding how OpenAI models are trained.arxiv.org/abs/2009.01325
arxiv.org/abs/2203.02155
Oct 4, 2023 4 tweets 4 min read
Language models contain extensive knowledge that can be extracted via prompting. But, does this come from direct exposure to identical info during pretraining, or does the model extract relevant knowledge from the data it encounters? Recent research might have the answer…

The Physics of LLMs. A recent series of papers, called “The Physics of Language Models”, studies the ability of LLMs to store, extract and manipulate knowledge. These papers were released in two parts, and they use open-source LLMs (e.g., LLaMA) and synthetic biography datasets to analyze the properties of LLMs.

TL;DR: The major findings of this work are that i) LLMs store information during pretraining (injecting more information during finetuning does not work), ii) this information can be retrieved reliably, iii) the LLM is not capable of manipulating/transforming this information in a complex manner (at least without CoT prompting), and iv) LLMs struggle to solve inverse reasoning problems (i.e., the reversal curse).

Storing and extracting knowledge. The first paper studies the ability of LLMs to store and extract information. An open-source LLM is first pretrained over a synthetic dataset of biographical information. Then, authors perform fine-tuning analysis to answer the question posed below.

“After training a language model on the biography dataset, can the model be finetuned to extract the knowledge to answer questions like `Where is the birth city of [name ]` or `What did [name ] study?`, and if so, how does the model achieve so?”

LLMs are not databases. From this analysis, we learn that language models cannot be finetuned to extract relevant knowledge if this knowledge is not stored properly within the LLM during pretraining. However, this problem can be solved by adding further data augmentation to the pretraining process. Such analysis solidifies findings from LIMA that indicate all knowledge is learned during pretraining.

Manipulating knowledge. Going further, the next paper in this series studies the ability of an LLM to perform four different (and distinct) styles of knowledge manipulation:

1. Retrieval: “What is person A’s attribute X?”
2. Classification: “Is A’s attribute X even or odd?”
3. Comparison: “Is A greater than B in attribute X?”
4. Inverse Search: “Which person’s attribute X equals T?”

Using another synthetic dataset of biographical information, authors find that LLMs can accurately perform retrieval. Classification and comparison can only be performed with added CoT prompting, while inverse search is not performed accurately even with sophisticated prompting.

What do we learn? The major finding from this work is that LLMs are not good at manipulation knowledge beyond simple operations (e.g., retrieval). Such a finding explains why CoT prompting is so important—it allows us to break tasks that require complex knowledge manipulation into smaller, solvable steps.

Image
Image
Here are links to the relevant papers:

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction:

Physics of Language Models: Part 3.2, Knowledge Manipulation:

Thanks @ZeyuanAllenZhu for the awesome papers/figures!arxiv.org/abs/2309.14316
arxiv.org/abs/2309.14402
Jul 14, 2023 4 tweets 2 min read
Recently proposed open-source language models have placed an emphasis upon inference speed. Such work has shown us that inference speed can be improved by up to 5X (or more) by making some changes to the decoder-only transformer architecture. Here are three examples that have… https://t.co/Ww4ptbJkb4twitter.com/i/web/status/1…
Image For more information on flash attention, check out the paper!

arxiv.org/abs/2205.14135
Jun 19, 2023 4 tweets 3 min read
The LLaMA suite of large language models (LLMs) led to a surge in publications on the topic of open-source LLMs. Many of these works adopted an imitation approach, in which less powerful LLMs were fine-tuned on ChatGPT dialogues. These imitation models seemed to perform… twitter.com/i/web/status/1… Image "The False Promise of Imitating Proprietary LLMs” is full of incredibly useful information and is a must-read for researchers in the LLM space. Primarily, this paper made me rethink how to properly evaluate LLMs and the systems built around them.

arxiv.org/abs/2305.15717 twitter.com/i/web/status/1…
Jun 16, 2023 4 tweets 2 min read
Next-token prediction is the workhorse behind all modern advancements in large language models (LLMs) due to its use in training these models over unlabeled text. But, how exactly does this next-token prediction (or language modeling) objective work? Let’s take a deeper look…… twitter.com/i/web/status/1… Image For those confused why the mathematical expression for next-token prediction uses log probabilities, check out this discussion.

chrispiech.github.io/probabilityFor…
Jun 14, 2023 4 tweets 3 min read
In the wake of LLaMA, the deep learning research community quickly adopted the view that open-source LLMs will rule the future—reproducing open-source variants of proprietary models seemed to be easy and cheap. Is this the truth? Here’s a brief timeline of model proposals and… twitter.com/i/web/status/1… Image Find the original LLaMA publication here: arxiv.org/abs/2302.13971