Tweet

@LangChainAI

@karpathy

@LangChainAI

@LangChainAI

More from @RLanceMartin

Lance Martin

@RLanceMartin

May 30

@MosaicML

There's a lot of questions abt smaller, open source LLMs vs larger, closed models for tasks like question answering. So, we added @MosaicML MPT-7B & @lmsysorg Vicuna-13b to @LangChainAI auto-evaluator. You test them on your own Q+A use-case ... autoevaluator.langchain.com/playground

@jefrankle

... great pod w/ @jefrankle @swyx @abhi_venigalla on MPT-7B, so used auto-evaluator to benchmark it on a test set of 5 Q+A pairs from the GPT3 paper. Results are close to larger models and it's very fast (kudos to @MosaicML inference team!) ...

https://twitter.com/karpathy/status/1660824101412548609?s=20

@sfgunslinger

... @sfgunslinger also deployed Vicuna-13b on @replicatehq and it achieves performance parity w/ the larger models on this test set. Prompt eng may further improve this (v helpful discussion w/ the folks at @replicatehq / @JoeEHoover); we are looking into improving latency ...

Read 7 tweets

Lance Martin

@RLanceMartin

May 16

@AnthropicAI

I've seen questions about @AnthropicAI's 100k context window: can it compete w/ vectorDB retrieval? We added Claude-100k to the @LangChainAI auto-evaluator app so you can compare for yourself (details showing Claude-100k results below). App is here:
autoevaluator.langchain.com/playground

.. there are many retrieval approaches for Q+A that fetch docs relevant to a question followed by LLM answer synthesis. But as LLM context window grows, retrieval may not be needed since you can just stuff the full doc(s) into the prompt (red in the diagram) ..

@AnthropicAI

.. we tested on Q+A eval sets from the GPT3 paper and SF Building Codes (75, 51 page PDFs). @AnthropicAI 100k was impressively close in terms of performance to various retrieval methods, but does have higher latency. See details here:
blog.langchain.dev/auto-evaluatio…

Read 5 tweets

Lance Martin

@RLanceMartin

May 1

@LangChainAI

Here's a free-to-use, open-source app for evaluating LLM question-answer chains. Assemble modular LLM QA chain components w/ @LangChainAI. Use LLMs to generate a test set and grade the chain.
Built by 🛠️ - me, @sfgunslinger, @thebengoldberg
Link - autoevaluator.langchain.com

@AnthropicAI

Inspired by 1) @AnthropicAI - model-written eval sets and 2) @OpenAI - model-graded evaluation. This app combines both of these ideas into a single workspace, auto-generating a QA test set for a given input doc and auto-grading the result of the user-specified QA chain.

@karpathy

You can use it two ways: 1) Demo mode: pre-loaded w/ the @karpathy episode from the @lexfridman pod and a test set. 2) Playground mode: upload your own doc and / or test test. In both cases, you can test QA chain configs and compare results (table and visually).

Read 5 tweets

Lance Martin

@RLanceMartin

Apr 16

@LangChainAI

I'm open-sourcing a tool I use to auto-evaluate LLM Q+A chains: given inputs docs, app will use an LLM to auto-generate a Q+A eval set, run on a user-selected chain (model, retriever, etc) built w/ @LangChainAI, use an LLM to grade, and store each expt. github.com/PineappleExpre…

@OpenAI

There are many model (@OpenAI, @AnthropicAI, @huggingface), retriever (SVM, vectorstores), and parameter (chunk size, etc) options. This lets you easily assemble combinations and evaluate them for Q+A (scoring and latency) on your docs of interest ...

@jerryjliu0

It uses an LLM to generate the eval set and an LLM as a grader. The prompts can be easily tuned (see code below) and you can ask the LLM grader to explain itself. It uses ideas from some helpful discussion w/ @jerryjliu0 on retrieval scoring ...
github.com/PineappleExpre…

Read 4 tweets

Lance Martin

@RLanceMartin

Mar 29

@theallinpod

Finally got GPT4 API access, so built an app to test it: here's Q+A assistant for all 121 episodes of the @theallinpod. You can ask any question abt the shows. It uses @OpenAI whisper model for audio -> text, @pinecone, @LangChainAI. App is here: besties-gpt.fly.dev

@LangChainAI

There is perf v latency trade-off for GPT4 vs ChatGPT (3.5-turbo). I used @LangChainAI to generate a QA eval set of 52 questions (w/ manual curation) and used a LLM to score them. GPT4 is better, but they are close (left below) and GPT4 is ~2x slower (right, w/ k=sim search docs)

@LangChainAI

@LangChainAI eval tooling is v useful. Ntkbs to generate the QA eval set based on the pod episodes and score them (using LLM as a grader) are below. V interested in further ideas on eval and have been discussing w/ @hwchase17. Thoughts welcome! github.com/PineappleExpre…

Read 6 tweets

Lance Martin

@RLanceMartin

Mar 20

@lexfridman

I built an app that uses ChatGPT for question-answering over all 365 episodes of the @lexfridman podcast. Uses @OpenAI Whisper model for audio-to-text and @LangChainAI. All code is open source (linked below). App: lex-gpt.fly.dev

@karpathy

I used @karpathy's Whisper transcriptions for the first 325 episodes and generated the rest. I used @LangChainAI for splitting transcriptions / writing embeddings to @pinecone, LangChainJS for VectorDBQA, and @mckaywrigley's UI template. Some notes below ...

@LangChainAI

1/ Chunk size has an influence on performance. I used @LangChainAI QAGenerationChain to create an eval set on the @karpathy episode and QAEvalChain to eval across chunk sizes. Interested in ideas to address this e.g., @gpt_index @jerryjliu0.

Read 6 tweets

Share this page!

Enter Twitter Thread URL to Unroll

Lance Martin

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @RLanceMartin

Lance Martin

Lance Martin

Lance Martin

Lance Martin

Lance Martin

Lance Martin

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!