Harrison Chase Profile picture
Dec 6 21 tweets 9 min read
A 🧵on results of a little investigation I did over the past week

❓How does text-davinci-003 do on agent-like tasks

TLDR: Displays superior understanding and ability to take multi-step actions towards original goal

h/t to @OfirPress and @momusbah for help/feedback with this
✨text-davinci-003 came out a week ago

Lots of people (like @blennon_ below) provided some great analysis of how it compared to `text-davinci-002`

This is my contribution to that, focused on agent-like tasks

What are agent-like tasks?

I focus mainly on exploring situations where the LLM is used as agent that has access to some tools, and needs to answer a question using the tools it has available. For example:

@OfirPress Self Ask with Seach
@ShunyuYao12 ReAct with Wikipedia
Why focus on agent-like tasks?

These tasks involve the LLM figuring out when it needs to use a tool, then observing the result of using that tool, and then taking another action based on that observation

IMO these are all good proxies for intelligence
What did I evaluate?

1⃣Self-Ask with Search on @OfirPress compositional celebrities dataset
2⃣Anecdotal failure modes @momusbah found
3⃣ReAct on the HotPotQA dataset

Note that these are definitely NOT comprehensive, and if anyone is interested I'd love to collaborate on more
1⃣ First evaluation: Self-Ask (with Search) on the Compositional Celebrities Dataset

This is a dataset @OfirPress created to judge the LLMs reasoning ability on multi-hop question answering

Read more about it here: ofir.io/self-ask.pdf
In addition to self-ask with search, we also benchmarked regular self-ask (as a baseline)

Big thank you to @OfirPress for helping run some of this evaluation + reasoning through it
On the compositional celebrity dataset we benchmarked four things:

Self ask (no search) with -002
Self ask (no search) with -003
Self ask with search with -002
Self ask with search with -003

Here are the results Image
We can see improves on both normal Self Ask and Self Ask with Search

NOTE: the reason that self ask with search actually does worse is that a lot of times the search wrapper we are using does not return good results

Here is the notebook we used colab.research.google.com/drive/1LP_KTWX…
The fact that self-ask with search showed a large improvement (0.34 -> 0.41) while normal self-ask did not suggests to me that one of the big improvements of -003 is in its ability to use and interact with external tools (rather than its chain of thought like reasoning)
2⃣ The second method I used to evaluate was some anecdotal failure modes @momusbah provided

We had actually been talking the day before `text-davinci-003` came out about these, so it was great timing!

Let's take a look at some of these

Notebook here: colab.research.google.com/drive/1UVnBEHg…
The first example shows `text-davinci-003` being better at keeping the original question in mind and not getting distracted

We can see it starts with the same original question as 002, but whereas 002 gets a bit lost, 003 asks good followups ImageImage
The next two examples show text-davinci-003 doing a bit better with the concept of time, namely handling current events

Specifically, it understands the underlying user intent that the user is asking for CURRENT information, whereas text-davinci-002 does not ImageImage
However, it is not perfect.

Here is one example where it (along with text-davinci-002) struggles to understands more complex ideas

Although even here its answer is MUCH more reasonable than 002 (which seems to forget the original question) ImageImage
3⃣Finally, let's evaluate it a third way: using a ReAct agent with access to Wikipedia on the HotPotQA dataset

The ReAct Agent is based on a framework @ShunyuYao12 came up with (see below tweet)

Notebook here: colab.research.google.com/drive/1UVnBEHg…

I was somewhat resource constrained here and so only ran it on a few examples from the dataset

But we can see a lot of similar trends as before: text-davinci-003 does a much better job of understanding and remembering the original question
The difference in this first example is small

text-davinci-003 remembers the original question and responds appropriately (yes/no rather than with the nationality)

But it also goes off on a bit of a tangent (before getting back on track) which is a bit weird... ImageImage
In this next example, text-davinci-003 is much better at executing down a sequence of more complicated reasoning logic ImageImage
And in this final example, it again displays ability to continue executing in a more long form nature ImageImage
My main takeaways from all of this are that text-davinci-003:

- displays superior understanding of intent
- displays superior ability to do take multistep actions towards the original objective
This is far from a comprehensive analysis! I found myself a bit time and resource constrained

If you are interested in collaborating on evaluation of LLMs on agent-like tasks please reach out - I would love to do so!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Harrison Chase

Harrison Chase Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @hwchase17

Dec 5
Some really cool stuff added by @subby_tech (👏👏👏) to @LangChainAI over the weekend:

⚕️APIChain

A general framework for interacting with an API in natural language

🧵See below for a more in depth explanation + examples
At a high level, the flow is:

1⃣ Format a prompt with API docs + a question
2⃣ Have an LLM generate API query to run to get an answer
3⃣ Run said API query
4⃣ Have LLM interpret API response and answer original question in natural language
Note that the LLM is doing in context learning (via the API docs) to figure out how to call the API

For popular APIs, the LLM may(?) be able to generate the correct API call without that context... but this methodology allows it to work on smaller, newer, or private APIs
Read 10 tweets
Dec 5
With lots of "unofficial" ChatGPT APIs popping up (most based on @danielgross's code), there's been a lot of asks to hook this into LangChain.

Here's how to do so:
1. Step up an unofficial ChatGPT API

Here's one example from @taranjeetio: github.com/taranjeet/chat…

Here's another example from @kylejohnmorris: github.com/kylejmorris/ch…
2. Write a CustomLLM @LangChainAI wrapper

Here's an example wrapper for @taranjeetio's API implementation: gist.github.com/hwchase17/af22…

Although since most implementations are based on @danielgross's implementation, this wrapper should work for most
Read 4 tweets
Dec 1
A new chain we introduced in @LangChainAI (with a lot of help from @johnjnay 👏👏):

❓Question Answering with Sources❓

github.com/hwchase17/lang…

This takes in a question and a list of documents, and uses those documents to answer that question, citing its sources
There are several cool things about this chain

1⃣ It's a general chain that can be applied to lots of problems

2⃣ It runs over each document individually, and then combines the answers, meaning it can work on longer documents

3⃣ Citing sources is an extremely important UX!
One potential use case (from @johnjnay at @CodeXStanford )

LLM understanding of legal reasoning / legal language.

Given that legal documents are long and complicated, they are developing approaches for LLMs to recursively analyze them in sequential chains of LLM interactions.
Read 7 tweets
Nov 29
Last week, the PaL (Program-aided Language models) paper came out, pushing the boundary of LLM performance on math/symbolic reasoning tasks

As of today, it's in 🦜🔗LangChain

🙂Was super fun to work with @urialon1 & @aman_madaan on this

Collab notebook: colab.research.google.com/drive/1OCuadD5…
See the below for a far better explanation of PaL than I could ever give

If I had to give one: use an LLM to figure out what to do, but a Python interpreter to actually do it

When this paper came I think at least five people told me I had to include it in LangChain

I reached out to @urialon1 and he was excited by the possibility

He put me in contact with @aman_madaan who fielded a lot of my silly questions and helped get it in 🙂
Read 5 tweets
Nov 23
An overview of a new abstraction in @LangChainAI:

🤖 Agents

Agents use an LLM to determine which which actions to take and in what order.

Big h/t to @dmdohan and @vladquant for helping think through a lot of this

🧵Here's an explanation of what agents are and how to use them
🤖Agents use an LLM to determine which actions to take and in what order.

🏃‍♂️An action can either be using a tool and observing its output, or returning to the user.

There have been many (better) pieces written on this concept recently that I will link to at the end
🔧 "Tools" in this context can be anything that takes a string as input and outputs a string.

This can be a search engine, a database, another LLM, a chain, or event another agent

A lot of these tools are already in LangChain
Read 19 tweets
Nov 23
💥 We've added a LOT of stuff to @LangChainAI recently

I've gotten asked a few times by users & contributors what LangChain helps with and what the main value props are (the amount in there doesn't make that clear)

Here's my answer:
🦜🔗LangChain is aimed at making it easy to develop applications with LLMs. There are 3 main areas it helps with (with a bonus sneak peak of a 4th). In increasing order of complexity:

🦜 LLMs and Prompts
🔗 Chains
🤖 Agents
🧠 ****** (you have to read to end to find out)
I'll go over all of these in this thread, but for more information please see:

Full documentation: langchain.readthedocs.io/en/latest/

Getting started section (walks through all of these in a beginner friendly way): langchain.readthedocs.io/en/latest/gett…
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(