I probably should have written this years ago, but here are some MLOps principles I think every ML platform (codebase, data management platform) should have: 1/n
Beginner: use pre-commit hooks. ML code is so, so ugly. Start with the basics — black, isort — then add pydocstyle, mypy, check-ast, eof-fixer, etc. Honestly I put these in my research codebases too, lol. 2/n
Beginner: always train models using *committed* code, even in development. This allows you to attach a git hash to every model. Don’t make ad hoc changes in Jupyter & train a model. Someday someone will want to know what code generated that model… 3/n
Beginner: use a monorepo. Besides known software benefits (simplified build & deps), a monorepo reduces provenance & logging overhead (critical for ML). I’ve seen separate codebases for model training, serving, data cleaning, etc & it’s a mess to figure out what’s going. on 4/n
Beginner: version your training & validation data! Don’t overwrite train.pq or train.csv, because later on, you may want to look at the data a specific model was trained on. 5/n
Beginner: put SLAs on data quality. ML pipelines often break bc of some data-related bug. There are preliminary tools to automate data quality checks but we can't solely rely on them. Have an on-call rotation to manually sanity-check the data (eg look at histograms of cols) 6/n
Intermediate: put *some* effort into ML monitoring. Plenty of ppl are like, “oh we have delayed labels so we don’t monitor accuracy.” Make an on-call rotation for this: manually label a handful of predictions daily, and create a job to update the metric. Some info > no info 7/n
Intermediate: retrain models on a cadence (eg monthly) rather than when a KL divergence for an arbitrary feature arbitrarily drops. A cadence is less cognitive overhead. Do some data science to identify a cadence and make sure a human validates the new model every rotation 8/n
Advanced: shadow a less-complicated model in production so you can easily serve those predictions with one click if the main model goes down / is broken. ML bugs can take a while to diagnose so it’s good to have a reliable backup 9/n
Advanced: put ML-related tests in CI. You can do almost anything in Github Actions. Create test commands that: overfit your training pipeline to a tiny batch of data, verify data shapes, check integrity of features, etc. Whatever your product needs. 10/n
• • •
Missing some Tweet in this thread? You can try to
force a refresh
what makes LLM frameworks feel unusable is that there's still so much burden for the user to figure out the bespoke amalgamation of LLM calls to ensure end-to-end accuracy. in , we've found that relying on an agent to do this requires lots of scaffolding docetl.org
first there needs to be a way of getting theoretically valid task decompositions. simply asking an LLM to break down a complex task over lots of data may result in a logically incorrect plan. for example, the LLM might choose the wrong data operation (projection instead of aggregation), and this would be a different pipeline entirely.
to solve this problem, DocETL uses hand-defined rewrite directives that can enumerate theoretically-equivalent decompositions/pipeline rewrites. the agent is then limited to creating prompts/output schemas for newly synthesized operations, according to the rewrite rules, which bounds its errors.
first, humans tend to underspecify the first version of their prompt. if they're in the right environment where they can get a near-instantaneous LLM response in the same interface (e.g., chatgpt, Claude, openai playground), they just want to see what the llm can do
there's a lot of literature on LLM sensemaking from the HCI community here (our own "who validates the validators" paper is one of many), but I still think LLM sensemaking is woefully unexplored, especially with respect to the stage in the mlops lifecycle
Our (first) DocETL preprint is now on Arxiv! "DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing" It has been almost 2 years in the making, so I am very happy we hit this milestone :-) arxiv.org/abs/2410.12189
DocETL is a framework for LLM-powered unstructured data processing and analysis. The big new idea in this paper is to automatically rewrite user-specified pipelines into a sequence of finer-grained and more accurate operators.
I'll mention two big contributions in this paper. First, we present a rich suite of operators, with three entirely new operators to deal with decomposing complex documents: the split, gather, and resolve operators.
DocETL is our agentic system for LLM-powered data processing pipelines. Time for this week’s technical deep dive on _gleaning_, our automated technique to improve accuracy by iteratively refining outputs 🧠🔍 (using LLM-as-judge!)
2/ LLMs often don't return perfect results on the first try. Consider extracting insights from user logs with an LLM. An LLM might miss important behaviors or include extraneous information. These issues could lead to misguided product decisions or wasted engineering efforts.
3/ DocETL's gleaning feature uses the power of LLMs themselves to validate and refine their own outputs, creating a self-improving loop that significantly boosts output quality.
LLMs have made exciting progress on hard tasks! But they still struggle to analyze complex, unstructured documents (including today's Gemini 1.5 Pro 002).
2/ Let's illustrate DocETL with an example task: analyzing presidential debates over the last 40 years to see what topics candidates discussed, & how the viewpoints of Democrats and Republicans evolved. The combined debate transcripts span ~740k words, exceeding context limits of most LLMs.
3/ But even for Gemini 1.5 Pro (2M token context limit), when given the entire dataset at once, it only reports on the evolution of 5 themes across all the debates! And, the reports get progressively worse as the output goes on. docetl.com/#demo-gemini-o…
recently been studying prompt engineering through a human-centered (developer-centered) lens. here are some fun tips i’ve learned that don’t involve acronyms or complex words
if you don’t exactly specify the structure you want the response to take on, down to the headers or parentheses or valid attributes, the response structure may vary between LLM calls / it is not amenable to production
play around with the simplest prompt you can think of & run it a bunch of times on different inputs to build intuition for how LLMs “behave” for your task. then start adding instructions to your prompt in the form of rules, e.g., “do not do X”