YouTube is a great source of content for LLM chat / Q+A apps. I recently added a @LangChainAI document loader to simplify this: pass in YouTube video urls, get back text documents that can be easily embedded for retrieval QA or chat (see below)🪄 github.com/hwchase17/lang…
@karpathy inspired this work a while ago w/ Whisper transcriptions of the @lexfridman pod. I used a similar pipeline to build a Q+A app, lex-gpt. @OpenAI Whisper API simplified the pipeline, so I wrapped it all in an easy-to-use @LangChainAI doc loader ..
.. see this notebook for example going from YouTube urls to a chat app in ~10 lines of code. You can find this feature in the latest @LangChainAI releases (> v0.0.192). github.com/rlancemartin/l…
I'll be spending a lot of time on Document Loaders at @LangChainAI and welcome any ideas / feedback: 1) what document loaders / integrations are missing? 2) what tutorials are missing? etc.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
There's a lot of questions abt smaller, open source LLMs vs larger, closed models for tasks like question answering. So, we added @MosaicML MPT-7B & @lmsysorg Vicuna-13b to @LangChainAI auto-evaluator. You test them on your own Q+A use-case ... autoevaluator.langchain.com/playground
... great pod w/ @jefrankle@swyx@abhi_venigalla on MPT-7B, so used auto-evaluator to benchmark it on a test set of 5 Q+A pairs from the GPT3 paper. Results are close to larger models and it's very fast (kudos to @MosaicML inference team!) ...
... @sfgunslinger also deployed Vicuna-13b on @replicatehq and it achieves performance parity w/ the larger models on this test set. Prompt eng may further improve this (v helpful discussion w/ the folks at @replicatehq / @JoeEHoover); we are looking into improving latency ...
I've seen questions about @AnthropicAI's 100k context window: can it compete w/ vectorDB retrieval? We added Claude-100k to the @LangChainAI auto-evaluator app so you can compare for yourself (details showing Claude-100k results below). App is here: autoevaluator.langchain.com/playground
.. there are many retrieval approaches for Q+A that fetch docs relevant to a question followed by LLM answer synthesis. But as LLM context window grows, retrieval may not be needed since you can just stuff the full doc(s) into the prompt (red in the diagram) ..
.. we tested on Q+A eval sets from the GPT3 paper and SF Building Codes (75, 51 page PDFs). @AnthropicAI 100k was impressively close in terms of performance to various retrieval methods, but does have higher latency. See details here: blog.langchain.dev/auto-evaluatio…
Here's a free-to-use, open-source app for evaluating LLM question-answer chains. Assemble modular LLM QA chain components w/ @LangChainAI. Use LLMs to generate a test set and grade the chain.
Built by 🛠️ - me, @sfgunslinger, @thebengoldberg
Link - autoevaluator.langchain.com
Inspired by 1) @AnthropicAI - model-written eval sets and 2) @OpenAI - model-graded evaluation. This app combines both of these ideas into a single workspace, auto-generating a QA test set for a given input doc and auto-grading the result of the user-specified QA chain.
You can use it two ways: 1) Demo mode: pre-loaded w/ the @karpathy episode from the @lexfridman pod and a test set. 2) Playground mode: upload your own doc and / or test test. In both cases, you can test QA chain configs and compare results (table and visually).
I'm open-sourcing a tool I use to auto-evaluate LLM Q+A chains: given inputs docs, app will use an LLM to auto-generate a Q+A eval set, run on a user-selected chain (model, retriever, etc) built w/ @LangChainAI, use an LLM to grade, and store each expt. github.com/PineappleExpre…
There are many model (@OpenAI, @AnthropicAI, @huggingface), retriever (SVM, vectorstores), and parameter (chunk size, etc) options. This lets you easily assemble combinations and evaluate them for Q+A (scoring and latency) on your docs of interest ...
It uses an LLM to generate the eval set and an LLM as a grader. The prompts can be easily tuned (see code below) and you can ask the LLM grader to explain itself. It uses ideas from some helpful discussion w/ @jerryjliu0 on retrieval scoring ... github.com/PineappleExpre…
Finally got GPT4 API access, so built an app to test it: here's Q+A assistant for all 121 episodes of the @theallinpod. You can ask any question abt the shows. It uses @OpenAI whisper model for audio -> text, @pinecone, @LangChainAI. App is here: besties-gpt.fly.dev
There is perf v latency trade-off for GPT4 vs ChatGPT (3.5-turbo). I used @LangChainAI to generate a QA eval set of 52 questions (w/ manual curation) and used a LLM to score them. GPT4 is better, but they are close (left below) and GPT4 is ~2x slower (right, w/ k=sim search docs)
@LangChainAI eval tooling is v useful. Ntkbs to generate the QA eval set based on the pod episodes and score them (using LLM as a grader) are below. V interested in further ideas on eval and have been discussing w/ @hwchase17. Thoughts welcome! github.com/PineappleExpre…
I built an app that uses ChatGPT for question-answering over all 365 episodes of the @lexfridman podcast. Uses @OpenAI Whisper model for audio-to-text and @LangChainAI. All code is open source (linked below). App: lex-gpt.fly.dev
I used @karpathy's Whisper transcriptions for the first 325 episodes and generated the rest. I used @LangChainAI for splitting transcriptions / writing embeddings to @pinecone, LangChainJS for VectorDBQA, and @mckaywrigley's UI template. Some notes below ...
1/ Chunk size has an influence on performance. I used @LangChainAI QAGenerationChain to create an eval set on the @karpathy episode and QAEvalChain to eval across chunk sizes. Interested in ideas to address this e.g., @gpt_index@jerryjliu0.