How to get URL link on X (Twitter) App
Meet raggy: an interactive debugging interface for RAG pipelines. It pairs a Python library of RAG primitives with a UI that lets devs inspect, edit, & rerun steps in real time. raggy precomputes many indexes for retrieval upfront, so you can easily swap them out when debugging!
(2/7) Following the release of DocETL (our data processing framework), we observed users struggling to articulate what they want & changing their preferences based on what the LLM could or couldn't do well. The main challenge is that no one knows what outputs they want until they see it; that is, agentic workflows are inherently iterative.
first there needs to be a way of getting theoretically valid task decompositions. simply asking an LLM to break down a complex task over lots of data may result in a logically incorrect plan. for example, the LLM might choose the wrong data operation (projection instead of aggregation), and this would be a different pipeline entirely.
https://twitter.com/simonw/status/1851771710510633081first, humans tend to underspecify the first version of their prompt. if they're in the right environment where they can get a near-instantaneous LLM response in the same interface (e.g., chatgpt, Claude, openai playground), they just want to see what the llm can do
DocETL is a framework for LLM-powered unstructured data processing and analysis. The big new idea in this paper is to automatically rewrite user-specified pipelines into a sequence of finer-grained and more accurate operators.
2/ LLMs often don't return perfect results on the first try. Consider extracting insights from user logs with an LLM. An LLM might miss important behaviors or include extraneous information. These issues could lead to misguided product decisions or wasted engineering efforts.
2/ Let's illustrate DocETL with an example task: analyzing presidential debates over the last 40 years to see what topics candidates discussed, & how the viewpoints of Democrats and Republicans evolved. The combined debate transcripts span ~740k words, exceeding context limits of most LLMs.
https://twitter.com/gantry_ml/status/1600508983814537222Split it into train/val/test, iterated on the feature set a bit, and eventually got a good test accuracy. Then I “productionized” it, i.e., put it in a dataswarm pipeline (precursor to Airflow afaik). Then I went back to school before the pipeline ran more than once.
https://twitter.com/eugeneyan/status/1563007015295008769(1) "quality" is not easily defined by a human---are you gonna comb through every feature column and create bounds?---and (2) the distribution matters; it's hard to look at one record alone and know whether it's "broken"
https://twitter.com/mhajabri/status/14600604334826659901. Convince yourself that operationalizing ML, even as a 1-person team, is a hard problem. What are some differences between a kaggle project and a production ML service? Do some tutorials -- Here's a more-than-hello-world toy ML pipeline I've built: github.com/shreyashankar/…