Large Language Models (LLMs) are notoriously bad at solving reasoning-based tasks. However, we can drastically improve their reasoning performance using simple techniques that require no fine-tuning or task-specific verifiers. Here’s how…🧵[1/7]
The technique is called chain-of-thought (CoT) prompting. It improves the reasoning abilities of LLMs using few-shot learning. In particular, CoT prompting inserts several examples of “chains of thought” for solving a reasoning problem into the LLM’s prompt. [2/7]
Here, a chain of thought is defined as “a coherent series of intermediate reasoning steps that lead to the final answer for a problem”. A CoT mimics how we solve reasoning problems as humans -- by breaking the problem down into intermediate steps that are easier to solve. [3/7]
Prior techniques teach LLMs how to generate coherent chains of thought via fine-tuning. Although this improves reasoning performance, such an approach requires an annotated dataset of reasoning problems with an associated CoT, which is burdensome and expensive to create. [4/7]
CoT prompting combines the idea of using chains of thought to improve reasoning performance with the few-shot learning abilities of LLMs. We can teach LLMs to generate a coherent CoT with their solution by just providing exemplars as part of their prompt. [5/7]
Such an approach massively improves LLM performance on tasks like arithmetic, commonsense and symbolic reasoning. Plus, it requires minimal data to be curated (i.e., just a few examples for the prompt) and performs no fine-tuning on the LLM. [6/7]
Put simply, CoT prompting is a simple prompting technique that can be applied to any pre-trained LLM checkpoint to improve reasoning performance. See the overview below for more details.
Can large language models (LLMs) train themselves? Recent research indicates that the answer might be yes… 🧵 [1/7]
But, what exactly do we mean by this? One notable method of using LLMs to train other LLMs involves using these models to generate data for instruction tuning. Typically, a larger, more powerful model is used for generation. [2/7]
This technique was pioneered by the self-instruct framework. Beginning with a small set of initial tasks (including one instruction and one input-output example per task), self-instruct uses LLMs to generate more data for instruction tuning. [3/7]
Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8]
First of all, what is a decoder-only architecture? Well, the architecture is exactly what is sounds like, a transformer architecture with the encoder removed. See the tweet below for more details. [2/8]
Decoder-only architectures use masked self-attention in each of their layers, meaning that each token considers only preceding tokens during the computation of self-attention. [3/8]
The new Stable Diffusion XL (SDXL) model is amazing, but I think there is considerable work to be done before prompt-based, generative image models reach their true potential. In particular, we need to fix one major problem… 🧵 [1/7]
Most of the text generated in any of the images from SDXL is still weird and oftentimes illegible! See for example some of the images with different prompts shown below. The text present in each image is mostly gibberish. [2/7]
At first, this may seem like a pretty small issue, and in some ways it is! SDXL is an incredible model that can produce a variety of useful outputs and understand intricate semantic details of textual prompts. [3/7]
Following the release of LLaMA, we saw a rapid explosion of open-source research on large language models (LLMs). Here are the three most notable model releases during this time… 🧵 [1/8]
1. Alpaca
Alpaca is a fine-tuned version of the LLaMA-7B LLM that performs similarly to OpenAI’s text-davinci-003 (i.e., GPT-3.5). It is created using instruction fine-tuning according to the self-instruct framework. [2/8]
Alpaca is trained using less than $600 (including both data collection and the compute cost of fine-tuning) and is found to roughly match the performance of GPT-3.5. Believe it or not, other LLaMA-based LLMs (following Alpaca) are created for even cheaper than this! [3/8]
“Chain of thought prompting can improve performance on various reasoning tasks... the benefits of chain of thought prompting only materialize with a sufficient number of model parameters (around 100B).”
Large language models (LLMs) are poor at solving basic reasoning tasks. We can improve this ability with chain-of-thought (CoT) prompting, which simply breaks a reasoning task into a multi-step process (i.e., chain-of-thought) within the LLM's prompt. [2/4]
CoT prompting is a generic idea that can be applied to many different reasoning tasks. Although different tasks may require some prompt engineering, we always follow the generic approach of injecting a chain-of-thought into the LLM's prompt. [3/4]
As Large Language Models (LLMs) improve in quality, evaluating them becomes more difficult. Recent models are so good that even humans struggle to discern differences in quality. Luckily, we can just create an automated evaluation framework using GPT-4! 🧵 [1/6]
This technique was pioneered by the recent Vicuna model, which is a version of LLaMA-13B that has undergone supervised fine-tuning (SFT) over a set of 70K instruction-following examples from ShareGPT.
To perform evaluation, authors of Vicuna devise eight question categories and have GPT-4 generate ten benchmark questions per category. Surprisingly, GPT-4 is capable (with proper prompt engineering) of generating challenging questions that many LLMs struggle to answer. [3/6]