One new idea of how to do this is to have language models evaluate themselves
@LangChainAI has now added some examples for doing so. Mostly focused on question/answering, but expanding soon (hopefully with community help!)
Thread 🧵
If you are using language models for text generation, you've probably realized that traditional metrics like accuracy and f1 score don't work that well
The common way to evaluate text generation tasks is manually (with human experts), which is slow and not scalable
I focus mainly on exploring situations where the LLM is used as agent that has access to some tools, and needs to answer a question using the tools it has available. For example:
A general framework for interacting with an API in natural language
🧵See below for a more in depth explanation + examples
At a high level, the flow is:
1⃣ Format a prompt with API docs + a question
2⃣ Have an LLM generate API query to run to get an answer
3⃣ Run said API query
4⃣ Have LLM interpret API response and answer original question in natural language
Note that the LLM is doing in context learning (via the API docs) to figure out how to call the API
For popular APIs, the LLM may(?) be able to generate the correct API call without that context... but this methodology allows it to work on smaller, newer, or private APIs
LLM understanding of legal reasoning / legal language.
Given that legal documents are long and complicated, they are developing approaches for LLMs to recursively analyze them in sequential chains of LLM interactions.