Latest Twitter Threads by @sh_reya on Thread Reader App

Apr 24 • 6 tweets • 2 min read

⭐new MLOps preprint⭐

RAG is everywhere, but building RAG is still painful. When something breaks--the retriever? the LLM?--developers are left guessing, & iterating is often slow

we built a better way & used it as a design probe to study expert workflows 👇

Meet raggy: an interactive debugging interface for RAG pipelines. It pairs a Python library of RAG primitives with a UI that lets devs inspect, edit, & rerun steps in real time. raggy precomputes many indexes for retrieval upfront, so you can easily swap them out when debugging!

Jan 13 • 7 tweets • 4 min read

Introducing 📜DocWrangler: an open-source IDE for AI-powered data processing with built-in prompt engineering guidance and output inspection tools.

Code: github.com/ucbepic/docetl
Blog: data-people-group.github.io/blogs/2025/01/…
Free research preview: docetl.org/playground

Built @ Berkeley (1/7)

(2/7) Following the release of DocETL (our data processing framework), we observed users struggling to articulate what they want & changing their preferences based on what the LLM could or couldn't do well. The main challenge is that no one knows what outputs they want until they see it; that is, agentic workflows are inherently iterative.

Dec 29, 2024 • 11 tweets • 3 min read

how come nobody is talking about how much shittier eng on-calls are thanks to blind integrations of AI-generated code? LLMs are great coders but horrible engineers. no, the solution is not “prompt the LLM to write more documentation and tests” (cont.) i will take react development as an example. I use cursor but I think the problems are not specific to cursor. Every time I ask for a new feature to be added to my codebase, it almost always uses at least 1 too many state variables. When the code is not correct (determined by my interaction with the react app), and I prompt the LLM with the bug + to fix it, it will almost always add complexity rather than rewrite parts of what it already had

Nov 4, 2024 • 5 tweets • 2 min read

what makes LLM frameworks feel unusable is that there's still so much burden for the user to figure out the bespoke amalgamation of LLM calls to ensure end-to-end accuracy. in , we've found that relying on an agent to do this requires lots of scaffolding docetl.org

first there needs to be a way of getting theoretically valid task decompositions. simply asking an LLM to break down a complex task over lots of data may result in a logically incorrect plan. for example, the LLM might choose the wrong data operation (projection instead of aggregation), and this would be a different pipeline entirely.

Oct 31, 2024 • 9 tweets • 2 min read

I have a lot of thoughts on this as someone who has manually combed through hundreds of humans' prompt deltas

https://twitter.com/simonw/status/1851771710510633081

first, humans tend to underspecify the first version of their prompt. if they're in the right environment where they can get a near-instantaneous LLM response in the same interface (e.g., chatgpt, Claude, openai playground), they just want to see what the llm can do

Oct 21, 2024 • 12 tweets • 3 min read

Our (first) DocETL preprint is now on Arxiv! "DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing" It has been almost 2 years in the making, so I am very happy we hit this milestone :-) arxiv.org/abs/2410.12189

DocETL is a framework for LLM-powered unstructured data processing and analysis. The big new idea in this paper is to automatically rewrite user-specified pipelines into a sequence of finer-grained and more accurate operators.

Oct 7, 2024 • 11 tweets • 3 min read

DocETL is our agentic system for LLM-powered data processing pipelines. Time for this week’s technical deep dive on _gleaning_, our automated technique to improve accuracy by iteratively refining outputs 🧠🔍 (using LLM-as-judge!)

2/ LLMs often don't return perfect results on the first try. Consider extracting insights from user logs with an LLM. An LLM might miss important behaviors or include extraneous information. These issues could lead to misguided product decisions or wasted engineering efforts.

Sep 24, 2024 • 9 tweets • 4 min read

LLMs have made exciting progress on hard tasks! But they still struggle to analyze complex, unstructured documents (including today's Gemini 1.5 Pro 002).

We (UC Berkeley) built 📜DocETL, an open-source, low-code system for LLM-powered data processing: data-people-group.github.io/blogs/2024/09/…

2/ Let's illustrate DocETL with an example task: analyzing presidential debates over the last 40 years to see what topics candidates discussed, & how the viewpoints of Democrats and Republicans evolved. The combined debate transcripts span ~740k words, exceeding context limits of most LLMs.

Oct 17, 2023 • 9 tweets • 2 min read

recently been studying prompt engineering through a human-centered (developer-centered) lens. here are some fun tips i’ve learned that don’t involve acronyms or complex words if you don’t exactly specify the structure you want the response to take on, down to the headers or parentheses or valid attributes, the response structure may vary between LLM calls / it is not amenable to production

Sep 12, 2023 • 8 tweets • 2 min read

thinking about how, in the last year, > 5 ML engineers have told me, unprompted, that they want to do less ML & more software engineering. not because it’s more lucrative to build ML platforms & devtools, but because models can be too unpredictable & make for a stressful job imo the biggest disconnect between ML-related research & production is that researchers aren’t aware of the human-centric efforts required to sustain ML performance. It feels great to prototype a good model, but on-calls battling unexpected failures chip away at this success

Mar 29, 2023 • 15 tweets • 3 min read

Been working on LLMs in production lately. Here is an initial thoughtdump on LLMOps trends I’ve observed, compared/contrasted with their MLOps counterparts (no, this thread was not written by chat gpt) 1) Experimentation is tangibly more expensive (and slower) in LLMOps. These APIs are not cheap, nor is it really feasible to experiment w/ smaller/cheaper models and expect behaviors to stay consistent when calling bigger models

Dec 23, 2022 • 5 tweets • 1 min read

IMO the chatgpt discourse exposed just about how many people believe writing and communication is only about adhering to some sentence/paragraph structure I’ve been nervous for some time now, not because I think AI is going to automate away writing-heavy jobs, but because the act of writing has been increasingly commoditized to where I’m not sure whether people know how to tell good writing from bad writing. Useful from useless.

Dec 7, 2022 • 22 tweets • 4 min read

I want to talk about my data validation for ML journey, and where I’m at now. I have been thinking about this for 6 ish years. It starts with me as an intern at FB. The task was to classify FB profiles with some type (e.g., politician, celebrity). I collected training data,

https://twitter.com/gantry_ml/status/1600508983814537222

Split it into train/val/test, iterated on the feature set a bit, and eventually got a good test accuracy. Then I “productionized” it, i.e., put it in a dataswarm pipeline (precursor to Airflow afaik). Then I went back to school before the pipeline ran more than once.

Sep 20, 2022 • 11 tweets • 4 min read

Our understanding of MLOps is limited to a fragmented landscape of thought pieces, startup landing pages, & press releases. So we did interview study of ML engineers to understand common practices & challenges across organizations & applications: arxiv.org/abs/2209.09125 The paper is a must-read for anyone trying to do ML in production. Want us to give a talk to your group/org? Email shreyashankar@berkeley.edu. You can read the paper for the war stories & insights, so I’ll do a “behind the scenes” & “fave quotes” in this thread instead.

Aug 26, 2022 • 13 tweets • 3 min read

Unit testing for ML is a big category of questions but here are my thoughts on the data validation piece (ensuring model inputs/outputs have good "quality" such that ML performance doesn't suffer). Old work in defining data constraints (e.g., Postgres style) fails us now bc

https://twitter.com/eugeneyan/status/1563007015295008769

(1) "quality" is not easily defined by a human---are you gonna comb through every feature column and create bounds?---and (2) the distribution matters; it's hard to look at one record alone and know whether it's "broken"

Jul 29, 2022 • 8 tweets • 2 min read

I'm excited (and nervous) to post this thread: I've always known I wanted a partner but didn't know what a supportive one looked like (esp. as an ambitious woman who wants kids someday)! Now that I know, I'm so grateful for all the ways in which @PreetumNakkiran supports me: All tasks have conception/planning/execution phases (Rodsky et al.). Often people think they did a whole task but only did the execution part (e.g., cook meals). Someone else has to conceive & plan (e.g., regularly grocery shop & stock the fridge). This is still a lot of labor!

Jun 22, 2022 • 6 tweets • 2 min read

Honestly: sometimes I feel defeated because ML observability is so hard. All facets are hard -- detecting, diagnosing, reacting to bugs. We don't have realtime ground truth labels (except recsys) so we don't know asap when performance goes down. Lots of $$ left on the table (1/6) By some miracle, maybe you know when pipelines are broken. Well, they have many models and thousands of features, so diagnosing is hard. I am sitting here, probing feature columns I didn't make, trying to figure out which features are important AND most broken. Nightmare (2/6)

May 4, 2022 • 10 tweets • 2 min read

I probably should have written this years ago, but here are some MLOps principles I think every ML platform (codebase, data management platform) should have: 1/n Beginner: use pre-commit hooks. ML code is so, so ugly. Start with the basics — black, isort — then add pydocstyle, mypy, check-ast, eof-fixer, etc. Honestly I put these in my research codebases too, lol. 2/n

Feb 17, 2022 • 9 tweets • 2 min read

I'm currently procrastinating on writing so I will write a thread on writing. This is mainly for STEM people who keep saying they want to write more. 1/9 Many people tell me they want to blog more, but they have several blockers (e.g., what to write about, how to find time, how to make a nice site to publish posts on). This mindset treats writing as a chore. First you need to figure out how to excite yourself to write 2/9

Dec 7, 2021 • 6 tweets • 1 min read

Meta-thread: a 🧵 on writing technical 🧵s

[x] is popular / taking our world by storm. However, [y] is a blocker & hard problem. We came up with [z]. Thread:

[only include "pull figure" if it makes sense with no additional context] The current state of the world looks like [blank]. This is not very great for several reasons, [blank].

Ex: There are 100s of new Medium posts and Arxiv papers every day. This sucks -- we won't read them all, yet we still want people to read our work.

Nov 15, 2021 • 7 tweets • 3 min read

IMO there's no substitute MLOps experience for building a pipeline that serves predictions at some endpoint (e.g., REST) and trying to sustain some performance over time. Some pointers & tutorials below:

https://twitter.com/mhajabri/status/1460060433482665990

1. Convince yourself that operationalizing ML, even as a 1-person team, is a hard problem. What are some differences between a kaggle project and a production ML service? Do some tutorials -- Here's a more-than-hello-world toy ML pipeline I've built: github.com/shreyashankar/…

Share this page!

Enter URL or ID to Unroll