first, humans tend to underspecify the first version of their prompt. if they're in the right environment where they can get a near-instantaneous LLM response in the same interface (e.g., chatgpt, Claude, openai playground), they just want to see what the llm can do
there's a lot of literature on LLM sensemaking from the HCI community here (our own "who validates the validators" paper is one of many), but I still think LLM sensemaking is woefully unexplored, especially with respect to the stage in the mlops lifecycle
not only do people want to just see what the LLM can do, but they also don't fully know what they are supposed to say in the prompt, or what answer they want (they won't know it until they see it).
I think of a prompt as a form you need to fill out to submit your task to the LLM, and the fields of this form are unknown and dynamic (i.e., task-specific). a prompt writing tool can make these fields more known
second, humans derive a lot of the prompt content based on their sensemaking. if they observe a weird output, they edit their prompt (usually by adding another instruction). many of these edits are clarifications/definitions
for example (see our docetl paper), if you are an investigative journalist and want an LLM to find all instances of police misconduct in a report/document, you have to define misconduct. there are many types of misconduct, each of which also may require their own definitions
the LLM-generated prompt writer is GREAT here to relieve blank page syndrome for defining important terms in prompts. if I ask Claude to generate a prompt for "find all instances of police misconduct in this document", it makes an attempt to start definitions, which I can then refine
third, with docetl (our AI-powered data processing tool), some of our users don't have lots of programming experience, and as a result, I've sent them starter pipelines they can use to process their data.
surprisingly many of them can move forward without me, tweaking pipeline prompts, editing them, adding new operations, etc. I think an AI assistant can do my job of initially drafting the pipeline
overall I think prompting is going to be a collaborative effort between humans and LLMs in the future. humans alone are limited; LLM-written prompts alone are limited (you need to have _some_ human input and feedback to solve the human's fuzzy or underspecified task).
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Our (first) DocETL preprint is now on Arxiv! "DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing" It has been almost 2 years in the making, so I am very happy we hit this milestone :-) arxiv.org/abs/2410.12189
DocETL is a framework for LLM-powered unstructured data processing and analysis. The big new idea in this paper is to automatically rewrite user-specified pipelines into a sequence of finer-grained and more accurate operators.
I'll mention two big contributions in this paper. First, we present a rich suite of operators, with three entirely new operators to deal with decomposing complex documents: the split, gather, and resolve operators.
DocETL is our agentic system for LLM-powered data processing pipelines. Time for this week’s technical deep dive on _gleaning_, our automated technique to improve accuracy by iteratively refining outputs 🧠🔍 (using LLM-as-judge!)
2/ LLMs often don't return perfect results on the first try. Consider extracting insights from user logs with an LLM. An LLM might miss important behaviors or include extraneous information. These issues could lead to misguided product decisions or wasted engineering efforts.
3/ DocETL's gleaning feature uses the power of LLMs themselves to validate and refine their own outputs, creating a self-improving loop that significantly boosts output quality.
LLMs have made exciting progress on hard tasks! But they still struggle to analyze complex, unstructured documents (including today's Gemini 1.5 Pro 002).
2/ Let's illustrate DocETL with an example task: analyzing presidential debates over the last 40 years to see what topics candidates discussed, & how the viewpoints of Democrats and Republicans evolved. The combined debate transcripts span ~740k words, exceeding context limits of most LLMs.
3/ But even for Gemini 1.5 Pro (2M token context limit), when given the entire dataset at once, it only reports on the evolution of 5 themes across all the debates! And, the reports get progressively worse as the output goes on. docetl.com/#demo-gemini-o…
recently been studying prompt engineering through a human-centered (developer-centered) lens. here are some fun tips i’ve learned that don’t involve acronyms or complex words
if you don’t exactly specify the structure you want the response to take on, down to the headers or parentheses or valid attributes, the response structure may vary between LLM calls / it is not amenable to production
play around with the simplest prompt you can think of & run it a bunch of times on different inputs to build intuition for how LLMs “behave” for your task. then start adding instructions to your prompt in the form of rules, e.g., “do not do X”
thinking about how, in the last year, > 5 ML engineers have told me, unprompted, that they want to do less ML & more software engineering. not because it’s more lucrative to build ML platforms & devtools, but because models can be too unpredictable & make for a stressful job
imo the biggest disconnect between ML-related research & production is that researchers aren’t aware of the human-centric efforts required to sustain ML performance. It feels great to prototype a good model, but on-calls battling unexpected failures chip away at this success
imagine that your career & promos are not about demonstrating good performance for a fixed dataset, but about how quickly on average you are able to respond to every issue some stakeholder has with some prediction. it is just not a sustainable career IMO
Been working on LLMs in production lately. Here is an initial thoughtdump on LLMOps trends I’ve observed, compared/contrasted with their MLOps counterparts (no, this thread was not written by chat gpt)
1) Experimentation is tangibly more expensive (and slower) in LLMOps. These APIs are not cheap, nor is it really feasible to experiment w/ smaller/cheaper models and expect behaviors to stay consistent when calling bigger models
1.5) we know from MLOps research that high experimentation velocity is crucial for putting and keeping pipelines in prod. A fast way is to collect a few examples, load up a notebook, try out a heck of a lot of different prompts—calling for prompt versioning & management systems