Large Language Models (LLMs) are notoriously bad at solving reasoning-based tasks. However, we can drastically improve their reasoning performance using simple techniques that require no fine-tuning or task-specific verifiers. Here’s how…🧵[1/7]
The technique is called chain-of-thought (CoT) prompting. It improves the reasoning abilities of LLMs using few-shot learning. In particular, CoT prompting inserts several examples of “chains of thought” for solving a reasoning problem into the LLM’s prompt. [2/7]
Here, a chain of thought is defined as “a coherent series of intermediate reasoning steps that lead to the final answer for a problem”. A CoT mimics how we solve reasoning problems as humans -- by breaking the problem down into intermediate steps that are easier to solve. [3/7]
Prior techniques teach LLMs how to generate coherent chains of thought via fine-tuning. Although this improves reasoning performance, such an approach requires an annotated dataset of reasoning problems with an associated CoT, which is burdensome and expensive to create. [4/7]
CoT prompting combines the idea of using chains of thought to improve reasoning performance with the few-shot learning abilities of LLMs. We can teach LLMs to generate a coherent CoT with their solution by just providing exemplars as part of their prompt. [5/7]
Such an approach massively improves LLM performance on tasks like arithmetic, commonsense and symbolic reasoning. Plus, it requires minimal data to be curated (i.e., just a few examples for the prompt) and performs no fine-tuning on the LLM. [6/7]
Put simply, CoT prompting is a simple prompting technique that can be applied to any pre-trained LLM checkpoint to improve reasoning performance. See the overview below for more details.
LLM-as-a-Judge is one of the most widely-used techniques for evaluating LLM outputs, but how exactly should we implement LLM-as-a-Judge?
To answer this question, let’s look at a few widely-cited papers / blogs / tutorials, study their exact implementation of LLM-as-a-Judge, and try to find some useful patterns.
(1) Vicuna was one of the first models to use LLMs as an evaluator. Their approach is different depending on the problem being solved. Separate prompts are written for i) general, ii) coding, and iii) math questions. Each domain-specific prompt introduces some extra, relevant details compared to the vanilla prompt. For example:
- The coding prompt provides a list of desirable characteristics for a good solution.
- The math prompt asks the judge to first solve the question before generating a score.
Interestingly, the judge is given two model outputs within its prompt, but it is asked to score each output on a scale of 1-10 instead of just choosing the better output.
(2) AlpacaEval is one of the most widely-used LLM leaderboards, and it is entirely based on LLM-as-a-Judge! The current approach used by AlpacaEval is based upon GPT-4-Turbo and uses a very simple prompt that:
- Provides an instruction to the judge.
- Gives the judge two example responses to the instruction.
- Asks the judge to identify the better response based on human preferences.
Despite the simplicity, this strategy correlates very highly with human preference scores (i.e., 0.9+ Spearman correlation with chatbot arena).
(3) G-Eval was one of the first LLM-powered evaluation metrics that was shown to correlate well with human judgements. The key to success for this metric was to leverage a two-stage prompting approach. First, the LLM is given the task / instruction as input and asked to generate a sequence of steps that should be used to evaluate a solution to this task. This approach is called AutoCoT. Then, the LLM uses this reasoning strategy as input when generating an actual score, which is found to improve scoring accuracy!
(4) The LLM-as-a-Judge paper itself uses a pretty simple prompting strategy to score model outputs. However, the model is also asked to provide an explanation for its scores. Generating such an explanation resembles a chain-of-thought prompting strategy and is found to improve scoring accuracy. Going further, several different prompting strategies–including both pointwise and pairwise prompts–are explored and found to be effective within this paper.
Key takeaways. From these examples, we can arrive at a few common takeaways / learnings:
- LLM judges are very good at identifying responses that are preferable to humans (due to training with RLHF).
- Creating specialized evaluation prompts for each domain / application is useful.
- Providing a scoring rubric or list of desirable properties for a good solution can be helpful to the LLM.
- Simple prompts can be extremely effective (don’t make it overly complicated!).
- Providing (or generating) a reference solution for complex problems (e.g., math) is useful.
- CoT prompting (in various forms) is helpful.
- Both pairwise and pointwise prompts are commonly used.
- Pairwise prompts can either i) ask for each output to be scored or ii) ask for the better output to be identified.
Also some examples of pairwise, pointwise, and reference prompts for LLM-as-a-Judge (as proposed in the original paper, where I found these figures) are shown in the image below for quick reference!
If you're not familiar with LLM-as-a-Judge in general, I wrote a long-form overview on this topic and the many papers that have been published on it. See below for more details!
Recently, I’ve done a ton of reading on LLM-as-a-judge techniques (i.e., using an LLM to evaluate the output of another LLM). Here’s a reference of the best papers in this space:
(1) Early research: Research on LLM evaluators began with the proposal of GPT-4, which was (arguably) the first LLM powerful enough to reliably evaluate output quality. At this time, several works explored the usage of LLMs as evaluators:
- Sparks of AGI [1]: This paper broadly studies the behavior of GPT-4, finding that the model excels at nearly all tasks that were considered. As part of this analysis, authors use GPT-4 to evaluate the similarity of a model’s output to a reference output. This is the first work (as far as I know) that attempts to use GPT-4 as a judge.
- Open LLMs: After the proposal of LLaMA, several imitation models followed. Of these imitation models, several of them (Vicuna, LIMA, Guanaco, Tulu, Orca, and more) use LLMs to evaluate the quality of model outputs relative to ChatGPT.
- AlpacaEval [7]: In a similar timeframe, the AlpacaEval metric was proposed. AlpacaEval uses a fixed set of ~800 prompts and generates an output for each prompt with a baseline model (GPT-4-Turbo) and a model being evaluated. Then, we prompt an LLM judge (GPT-4) to compare the quality of model outputs for each prompt, allowing us to automatically compute a win rate.
(2) More formal analysis: After initial explorations of LLM-as-a-judge, researchers began to formalize these techniques and analyze them more deeply. Such work revealed that this technique is powerful but subject to interesting biases that are hard to detect.
- G-Eval [8] uses a chain of thought approach to evaluate output quality. First, the LLM is asked to output a set of steps for evaluating output on a particular task. Then, the LLM ingests this evaluation framework and executes the evaluation via a form-filling paradigm (i.e., just generating the score as an output).
- LLMs as an alternative to human evaluation [9]: Authors do a formal study of the feasibility of using LLMs to replicate the human evaluation process, finding that the results of LLM evaluation are consistent with those of expert human evaluation for story generation and adversarial example generation tasks.
- LLM-as-a-judge [10]: Written by the creators of Vicuna, this paper formalizes the LLM-as-a-judge technique, proposing several setups for evaluating model outputs with an LLM. However, authors also reveal several biases of LLM evaluators, including position bias, verbosity bias, self-enhancement bias, and limited reasoning capability.
- LLMs are not fair evaluators [11]: This paper studies bias within LLM evaluations, focusing upon position bias in particular. They find that altering the position of model outputs within the prompt used for the LLM judge can drastically change evaluation results, but we can solve this issue by randomizing the position of outputs within the prompt.
(3) Specialized evaluators: Although this post focuses upon LLM-as-a-judge techniques, a large amount of research has also been published on the topic of training specialized LLMs for evaluation. The most popular example of this is the Prometheus series of models [12, 13]. However, several other examples exist, such as JudgeLM [14] or PandaLM [15].
Although mixture-of-experts (MoEs) were initially applied to LSTM-based language models [1], the Switch Transformer [2] was one of the first papers to apply MoEs to the transformer. Here’s how it works…
Motivation. After the proposal of the sparsely-gated MoE in [1], adoption was hindered by the general complexity of MoEs, as well as issues like high communication costs and training instability. Authors in [2] propose an MoE-based encoder-decoder transformer architecture, called the Switch Transformer, that uses a simplified gating mechanism to make training more stable, thus making MoEs a more realistic and practical choice for language modeling applications.
MoE for transformers. To create an MoE encoder-decoder transformer, we can simply convert the feed-forward sub-layers of the model into MoE layers. The feed-forward transformation is applied in a pointwise fashion, meaning that each token is passed individually through the feed-forward network. For this reason, each token within the sequence is individually routed to its set of corresponding experts.
For example, each token in the sequence [“I”, “love”, “LLM”, “s”] is passed through the routing function, forming a probability distribution over experts. Then, we select the top-K experts for each individual token—tokens in the same sequence are not always sent to the same experts.
Better routing. In [1], the minimum number of active experts in any MoE layer was two—this was thought to be necessary to have non-trivial gradients in the routing function. In [2], authors propose routing each token to only a single expert, which is called a switch layer. This simplifies the routing function, reduces computational overhead, and lessens communication costs while improving the model’s performance.
The routing function used by the Switch Transformer is just a softmax gating mechanism. We pass each token vector through a linear layer that produces an output of size N (i.e., the number of experts), then apply a softmax transformation to convert this output into a probability distribution over experts. From here, we compute the output of the switch layer by:
1. Selecting a single expert. 2. Scaling the output of this expert by the probability assigned to that expert by the routing function.
Simple load balancing. In [1], authors employ multiple auxiliary loss functions to balance importance scores and perform load balancing between experts (i.e., meaning that each expert is sent a roughly equal number of tokens from the batch). We see in [2] that both of these objectives can be achieved with a single auxiliary loss function that is applied at each switch layer in the model. This loss encourages both the fraction of tokens allocated to each expert and the fraction of router probability allocated to each expert to be 1/N, meaning that experts are equally important and receive a balanced number of tokens.
Capacity factor. Within the Switch Transformer, we set a global “expert capacity” variable that determines the maximum number of tokens that can be routed to each expert in any MoE layer. Each token is routed to the expert that is assigned the highest probability by the routing mechanism. If too many tokens (i.e., exceeding the expert capacity) are sent to a single expert, computation for these tokens will be skipped. These “dropped” tokens are passed directly to the next layer via the residual connection. Setting the capacity factor greater than one allows the MoE to handle imbalanced tokens across experts.
——
Bibliography
[1] Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).
[2] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.120 (2022): 1-39.
To gain a better understanding of the basic concepts behind MoEs, check out my overview of the topic below. The MoE layer is a key advancement in LLM research that powers popular LLMs like Grok-1 and (per rumors in the community) GPT-4.
I also recently wrote a more in-depth overview of the history of Mixture-of-Experts, which dates all the way back to the 1990s! See the image below for details.
Masked self-attention is the key building block that allows LLMs to learn rich relationships and patterns between the words of a sentence. Let’s build it together from scratch…
The big picture: Large language models are based upon a deep neural network architecture called a decoder-only transformer. Within each layer of this model, we have two key components:
1. Masked self-attention: learns relationships between tokens/words. 2. Feed-forward transformation: individually transforms the representation of each word.
These components are complementary—attention looks across the sequence, while feed-forward transformations consider each token individually. When combined together, they allow us to learn complex patterns from text that power the AI applications that are so popular today.
TL;DR: The input to an attention model is a list of token/word vectors, which can be stacked together to form a matrix. Causal self-attention operates by computing an attention/importance score between each pair of tokens/words in a sequence. Then, the output of self-attention is a weighted combination of all words in the sequence, where the weight is given by the attention score. We can break the process of masked self-attention into a sequence of five steps.
(1) Linear projections: The first step is to perform three separate linear projections, called the query, key, and value projections. Practically, these projections take our sequence of token vectors as input and produce three transformed sequence of token vectors as output.
(2) Attention scores: To compute attentions scores, we use the query and key vectors produced by the linear projections described above. The attention score between the i-th token and the j-th token in the sequence is given by the dot product of the i-th query vector and the j-th key vector. To compute all of these pairwise scores efficiently, we can stack the query/key vectors into matrices and take the matrix product of the query matrix with the transposed key matrix. The output is a TxT attention matrix, where T is the length of the input sequence (in tokens). To improve training stability, we also divide the values of the attention matrix by the square root of the size of the token vectors (i.e., scaled dot product attention).
(3) Forming a probability distribution: From here, we can turn the attention scores for each token into a probability distribution by performing a softmax operation across each token’s attention scores for the sequence. In practice, this is done via a softmax operation across each row of the attention matrix. After this, each row of the attention matrix becomes a probability distribution that represents the (normalized) attention scores for a single token across the sequence (i.e., the i-th row contains the i-th token’s attention scores).
(4) Masking operation: In vanilla self-attention, each token is allowed to compute attention scores for all tokens in the sequence. In masked self-attention, however, we mask attention scores for any token that follows a given token in the sequence. We can implement this by simply masking the attention matrix prior to performing the softmax (i.e., fill entries for any invalid attention scores with a value of negative infinity), such that the probability of any future token in the sequence becomes zero. For example, the i-th token in the sequence would have an attention scores of 0 for tokens i + 1, i + 2, and so on. Practically, masked self-attention prevents us from looking forward in the sequence when computing a token’s representation.
(5) Computing the output: From here, we can compute the output of masked self-attention by taking the matrix product of the attention matrix and a matrix of value vectors. This operation computes the output for the i-th token by taking a weighted combination of all value vectors, where the weights are given by token i’s attention scores.
An implementation of masked self-attention in PyTorch (derived from NanoGPT by Andrej Karpathy) is provided below. As we can see, the implementation of masked self-attention is easy to follow if we understand the concepts behind the computation!
For a more detailed exposition of masked self-attention and its role in LLMs (or the transformer architecture in general), I'd recommend the overview that I recently published on the decoder-only transformer model. Check it out below!
New language models get released every day (Gemini-1.5, Gemma, Claude 3, potentially GPT-5 etc. etc.), but one component of LLMs has remained constant over the last few years—the decoder-only transformer architecture. This architecture has five components…
Why should we care? Research on LLMs moves fast. Shockingly, however, the architecture used by most modern LLMs is pretty similar to that of the original GPT model. We just make the model much larger, modify it slightly, and use a more extensive training (and alignment) process. For this reason, the decoder-only transformer architecture is one of the most fundamental/important ideas in AI research, so investing into understanding it deeply is a wise idea.
(1) Input layer: Decoder-only transformers receive a textual prompt as input. We use a tokenizer—based upon an algorithm like Byte-Pair Encoding (BPE)—to break this text into discrete tokens (i.e., words or sub-words). Then, we map each of these tokens to a corresponding vector stored in an embedding layer. This process forms a sequence of token vectors that are passed to the model as input. Optionally, we can augment these token vectors with additive positional embeddings.
(2) Causal self-attention is the core of the decoder-only transformer and allows the model to learn from relationships between tokens in the input. The vanilla self-attention operation transforms each token’s representation by taking a weighted combination of other token representations, where weights are given by pairwise attention/importance scores between tokens. Causal self-attention follows a similar strategy but only computes attention scores for preceding tokens in the sequence. Attention is performed in parallel across several heads (i.e., multi-head attention), each of which can focus upon different parts of the input sequence.
(3) Feed-forward transformations are performed within each block of the decoder-only transformer, allowing us to individually transform each token’s representation. This feed-forward component is a small neural network that is applied in a pointwise manner to each token vector. Given a token vector as input, we pass this vector through a linear projection that increases its size by ~4X, apply a non-linear activation function (e.g., SwiGLU or GeLU), then perform another linear projection that restores the original size of the token vector.
(4) Classification head: The decoder-only transformer has one final classification head that takes token vectors from the transformer’s final output layer as input and outputs a vector with the same size as the vocabulary of the model’s tokenizer. This vector can be used to either train the LLM via next token prediction or generate text at inference time via sampling strategies like nucleus sampling and beam search.
(5) Transformer blocks form the body of the decoder-only transformer architecture. The exact layout of the decoder-only transformer block may change depending upon the implementation, but two primary sub-layers are always present:
Additionally, these sub-layers are surrounded by a layer normalization module—either before or after the sub-layer (or both!)—as well as a residual connection.
I also wrote an explainer of the decoder-only transformer a long time ago, right around when I first began writing publicly-available content about deep learning. Check it out below.
I just published an in-depth walkthrough of the decoder-only architecture, all of its components, how we can implement them (in PyTorch), and how modern LLMs have modified this simple architecture to make it more effective. Check out the image below for more details!
There are a ton of different ways to finetune a language model. Here's a (brief) summary of language model finetuning, the various approaches that exist, their purpose, and what we know about how they work...
Finetuning techniques: The term “finetuning” simply refers to further training a pretrained model. In the case of LLMs, this means that we take a pretrained foundation model and train it some more. But, there are so many different ways that this training can be done, which makes the concept of finetuning incredibly vague. This single term can refer to a variety of different techniques, such as:
- Continued pretraining
- Instruction tuning
- Supervised finetuning (SFT)
- Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO)
What is the goal of these techniques? For language models, there are two primary goals that a practitioner will have when performing finetuning:
1. Knowledge injection: teach the model how to leverage new sources of knowledge (not present during pretraining) when solving problems. 2. Alignment (or style/format specification): modify the way in which the language model surfaces its existing knowledge base; e.g., abide by a certain answer format, use a new style/tone of voice, avoid outputting incorrect information, and more.
Given this information, we might wonder: Which finetuning techniques should we use to accomplish either (or both) of these goals? To answer this question, we need to take a much deeper look at recent research on the topic of finetuning.
Large-scale instruction tuning: Prior to the release of modern open-source LLMs, it was very common to finetune pretrained LLMs on massive instruction tuning datasets. Such an approach was popularized by models like FLAN [1] (from Google), which perform instruction tuning of pretrained language models over large datasets. In the case of FLAN, for example, the FLANv2 instruction tuning dataset contains over 15M examples—very large! By following this approach, FLAN can learn to solve a large number of different downstream tasks in an efficient manner.
“We show that by training a model on these instructions it not only becomes good at solving the kinds of instructions it has seen during training but becomes good at following instructions in general.” - from FLAN paper [1]
Beyond knowledge injection: After the proposal of ChatGPT, we saw an increase in the desire to align language models and adapt their output format to a particular style or structure. Such a goal is drastically different than teaching an LLM to solve a new task. When we are trying to teach an LLM new knowledge, more data is always better (hence the large instruction tuning datasets used by models like FLAN). However, aligning the language model to a certain style or structure of output does not require learning new information! So, maybe alignment-focused goals require less extensive finetuning.
Less is more for alignment: Research on the topic of LLM finetuning was catalyzed by the release of LLaMA [2] (and later LLaMA-2 [3]), which made high-quality foundation LLMs openly available. Quickly after LLaMA, authors from Meta published LIMA [4], which showed that alignment-style finetuning can be accomplished with very little data. Namely, the goal of alignment is to adapt the LLM’s style (rather than to learn new information), which can be accomplished via a small, high-quality, and diverse finetuning dataset. Such findings revealed that most of an LLM’s knowledge comes from pretraining, and the LLM learns the correct style during alignment (see quote below).
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.” - from LIMA paper [4]
Imitating proprietary LLMs: Following LIMA, a massive number of high-quality, finetuned LLMs (e.g., Alpaca, Vicuna, Koala, Orca, and more) were created by finetuning LLaMA over small synthetic finetuning datasets of GPT-3.5/4 outputs. In this way, we could train these models to imitate the output of more powerful LLMs. When evaluated in human trials and on simplistic benchmarks, these models seemed to match (or exceed) the performance of powerful models like ChatGPT. For this reason, practitioners began to believe that we could surpass models like GPT-4 or ChatGPT by performing a small amount of (inexpensive) finetuning.
What is going on here? Obviously, training a model like ChatGPT cannot be done this easily. Researchers quickly found some limitations in the work done on imitation models [5]:
- Humans are easily tricked if the style of the LLM is good, and (as shown by LIMA) these models can quickly learn to mimic the style of models like ChatGPT with little data.
- The benchmarks that were used are too limited. The models perform well when evaluated by a small group of humans, but their performance falls apart on more extensive benchmarks that include traditional, perplexity-based evaluations (e.g., normal NLP benchmarks).
We can learn certain things (e.g., style and output format) from finetuning over a small amount of data, but we can’t learn everything! These imitation models lack the knowledge base of more powerful LLMs, which can only be learned from large amounts of data.
Putting everything together: Given all of the information we’ve covered so far, there are a few takeaways that we can deduce:
- Most knowledge from an LLM comes from pretraining.
- We can perform finetuning in the form of continued pretraining to expose the LLM to more (and new) data/knowledge.
- Alignment-focused objectives can be achieved via finetuning (SFT) on small, high-quality datasets. We don’t need to tons of data to learn style or format of output, only to learn new knowledge.
When performing finetuning, it’s very important that we know which goal—either alignment or knowledge injection—that we are aiming for. Then, we should put benchmarks in place that allow us to accurately and comprehensively assess whether that goal was accomplished or not. Imitation models failed to do this, which led to a bunch of misleading claims/results!
Ongoing work: The story doesn’t stop here! In fact, the distinction between pretraining and finetuning is still quite vague. At what point does the LLM start actually learning new knowledge instead of just learning style/alignment? Many recent publications are continuing to study this question:
- Finetuning vs. RAG [6]: authors find that continued pretraining is not super effective at knowledge injection, while RAG is actually highly effective at specializing an LLM to a new knowledge base.
- LIMIT [7]: authors from MosiacML/Databricks show that we can perform finetuning over a small mixture of instruction tuning and alignment-focused data, leading to a model that performs well in both NLP benchmarks and style-focused evaluations.
- TULU [8]: authors subject finetuned LLMs to broader evaluations, finding that the quality of the base model has a massive impact on performance and that no one finetuning dataset/strategy yields the best results across all benchmarks.
- TULU-2 [9]: authors show that finetuning LLMs over specific datasets leads to the model learning specific skills and domains of data. Finetuning works well if we make sure the finetuning dataset is highly relevant to the style/domain of evaluation we are using.
- AlpaGasus [10]: authors directly study how much finetuning data is necessary for an LLM to perform well on various downstream tasks.
-------- Bibliography --------
[1] Wei, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).
[2] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
[3] Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023).
[4] Zhou, Chunting, et al. "Lima: Less is more for alignment." Advances in Neural Information Processing Systems 36 (2024).
[5] Gudibande, Arnav, et al. "The false promise of imitating proprietary llms." arXiv preprint arXiv:2305.15717 (2023).
[6] Ovadia, Oded, et al. "Fine-tuning or retrieval? comparing knowledge injection in llms." arXiv preprint arXiv:2312.05934 (2023).
[7] Jha, Aditi, et al. "LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms." arXiv preprint arXiv:2311.13133 (2023).
[8] Wang, Yizhong, et al. "How far can camels go? exploring the state of instruction tuning on open resources." Advances in Neural Information Processing Systems 36 (2024).
[9] Ivison, Hamish, et al. "Camels in a changing climate: Enhancing lm adaptation with tulu 2." arXiv preprint arXiv:2311.10702 (2023).
[10] Chen, Lichang, et al. "Alpagasus: Training a better alpaca with fewer data." arXiv preprint arXiv:2307.08701 (2023).
More interesting info on the superficial alignment hypothesis is provided below. This paper deeply studies the impact of alignment on LLMs, as well as proposes a "tuning-free" approach to alignment (maybe alignment literally requires no finetuning).
Also, the blog post written by Databricks on the LIMIT paper does a great job of summarizing finetuning techniques and what we know about them. I'd highly recommend giving it a read if you're interested in diving deeper into the nuances of finetuning.