Cameron R. Wolfe Profile picture
Apr 25 7 tweets 3 min read Twitter logo Read on Twitter
Large Language Models (LLMs) are notoriously bad at solving reasoning-based tasks. However, we can drastically improve their reasoning performance using simple techniques that require no fine-tuning or task-specific verifiers. Here’s how…🧵[1/7] Image
The technique is called chain-of-thought (CoT) prompting. It improves the reasoning abilities of LLMs using few-shot learning. In particular, CoT prompting inserts several examples of “chains of thought” for solving a reasoning problem into the LLM’s prompt. [2/7] Image
Here, a chain of thought is defined as “a coherent series of intermediate reasoning steps that lead to the final answer for a problem”. A CoT mimics how we solve reasoning problems as humans -- by breaking the problem down into intermediate steps that are easier to solve. [3/7] Image
Prior techniques teach LLMs how to generate coherent chains of thought via fine-tuning. Although this improves reasoning performance, such an approach requires an annotated dataset of reasoning problems with an associated CoT, which is burdensome and expensive to create. [4/7] Image
CoT prompting combines the idea of using chains of thought to improve reasoning performance with the few-shot learning abilities of LLMs. We can teach LLMs to generate a coherent CoT with their solution by just providing exemplars as part of their prompt. [5/7] Image
Such an approach massively improves LLM performance on tasks like arithmetic, commonsense and symbolic reasoning. Plus, it requires minimal data to be curated (i.e., just a few examples for the prompt) and performs no fine-tuning on the LLM. [6/7] Image
Put simply, CoT prompting is a simple prompting technique that can be applied to any pre-trained LLM checkpoint to improve reasoning performance. See the overview below for more details.

🔗: bit.ly/42067HU

[7/7]

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Cameron R. Wolfe

Cameron R. Wolfe Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @cwolferesearch

Apr 24
Can large language models (LLMs) train themselves? Recent research indicates that the answer might be yes… 🧵 [1/7] Image
But, what exactly do we mean by this? One notable method of using LLMs to train other LLMs involves using these models to generate data for instruction tuning. Typically, a larger, more powerful model is used for generation. [2/7] Image
This technique was pioneered by the self-instruct framework. Beginning with a small set of initial tasks (including one instruction and one input-output example per task), self-instruct uses LLMs to generate more data for instruction tuning. [3/7] Image
Read 7 tweets
Apr 21
Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8] Image
First of all, what is a decoder-only architecture? Well, the architecture is exactly what is sounds like, a transformer architecture with the encoder removed. See the tweet below for more details. [2/8]

Decoder-only architectures use masked self-attention in each of their layers, meaning that each token considers only preceding tokens during the computation of self-attention. [3/8]

Read 8 tweets
Apr 18
The new Stable Diffusion XL (SDXL) model is amazing, but I think there is considerable work to be done before prompt-based, generative image models reach their true potential. In particular, we need to fix one major problem… 🧵 [1/7]
Most of the text generated in any of the images from SDXL is still weird and oftentimes illegible! See for example some of the images with different prompts shown below. The text present in each image is mostly gibberish. [2/7] Image
At first, this may seem like a pretty small issue, and in some ways it is! SDXL is an incredible model that can produce a variety of useful outputs and understand intricate semantic details of textual prompts. [3/7] Image
Read 7 tweets
Apr 17
Following the release of LLaMA, we saw a rapid explosion of open-source research on large language models (LLMs). Here are the three most notable model releases during this time… 🧵 [1/8] Image
1. Alpaca

Alpaca is a fine-tuned version of the LLaMA-7B LLM that performs similarly to OpenAI’s text-davinci-003 (i.e., GPT-3.5). It is created using instruction fine-tuning according to the self-instruct framework. [2/8] Image
Alpaca is trained using less than $600 (including both data collection and the compute cost of fine-tuning) and is found to roughly match the performance of GPT-3.5. Believe it or not, other LLaMA-based LLMs (following Alpaca) are created for even cheaper than this! [3/8] Image
Read 8 tweets
Apr 13
🧵How can we teach LLMs to reason? 🧵

“Chain of thought prompting can improve performance on various reasoning tasks... the benefits of chain of thought prompting only materialize with a sufficient number of model parameters (around 100B).”

🔗: arxiv.org/abs/2201.11903

[1/4]
Large language models (LLMs) are poor at solving basic reasoning tasks. We can improve this ability with chain-of-thought (CoT) prompting, which simply breaks a reasoning task into a multi-step process (i.e., chain-of-thought) within the LLM's prompt. [2/4] Image
CoT prompting is a generic idea that can be applied to many different reasoning tasks. Although different tasks may require some prompt engineering, we always follow the generic approach of injecting a chain-of-thought into the LLM's prompt. [3/4] Image
Read 4 tweets
Apr 12
As Large Language Models (LLMs) improve in quality, evaluating them becomes more difficult. Recent models are so good that even humans struggle to discern differences in quality. Luckily, we can just create an automated evaluation framework using GPT-4! 🧵 [1/6] Image
This technique was pioneered by the recent Vicuna model, which is a version of LLaMA-13B that has undergone supervised fine-tuning (SFT) over a set of 70K instruction-following examples from ShareGPT.

🔗: sharegpt.com

[2/6]
To perform evaluation, authors of Vicuna devise eight question categories and have GPT-4 generate ten benchmark questions per category. Surprisingly, GPT-4 is capable (with proper prompt engineering) of generating challenging questions that many LLMs struggle to answer. [3/6] Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(