Omar Khattab Profile picture
Asst professor @MIT EECS & CSAIL (@nlp_mit). Author of https://t.co/VgyLxl0oa1 and https://t.co/ZZaSzaRaZ7 (@DSPyOSS). Prev: CS PhD @StanfordNLP. Research @Databricks.
Jun 15 4 tweets 2 min read
# On why I dislike the name "multi-agent" for LLM systems.

When you talk about these systems as "multi-agent", you evoke free-form social, communication, and goal-alignment problems that don't even need to exist.

You're just architecting a single functional system, not a society.

If you're doing this properly, you're defining *structured* contracts between the modules, and you control (or delegate!) information flow. You ensure each module has access to all information/tools it may need, and ideally nothing else.

It's a new type of software, of course, and we need a lot of empirical insights and good tooling! I know that, I've worked on nothing but these types of systems for close to 6 years.

But what's involved is good system architecture and all the decisions *that* entails, with highly structured module contracts, which are fuzzy in *just* the right places. Most of the time, it need not become some kind of social coordination between employees/agents who have conflicting goals.

Now, here's the BIG catch. A lot of the hard-coded architecture decisions here are (necessarily) myopic, because they depend on ephemeral things like "how long the context length of today's model is" or "how good the model is at dividing tasks".

This is why we need to think about programming languages and query languages that decouple our fundamental intent (and information flow) from the lower-level tricks that make them work.

This is not new: Modularity in conventional programming languages is an illusion that helps the *programmer* be productive. It's how we can reason about and maintain big systems.

Under the hood, any compiler worth its salt is breaking down so much of this modularity illusion under the hood — with function inlining, dead code elimination, fusing instructions, vectorizing SIMD work and parallelizing other tasks, etc.

Hope this helps! tl;dr It’s programming, not policy or management.
May 29 4 tweets 4 min read
Sigh, it's a bit of a mess.

Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place.

First, I have long advised from this account to simply avoid the benchmarks that the LLM trainers/providers work with. They're vastly more contaminated than you think. They're contaminated not only in terms of the task distribution, and not only in terms of train vs test, but actually also in terms of *the prompts that work*.

Second, let me elaborate on this prompts part. When a team trains an LLM to be good at math, they aren't just "making the LLM good at math". They're usually working with the same prompt template they'll use for evaluation. Because LLMs are statistical models, not logical computers, learning to be good at math in one prompt template does NOT mean that the model is good at math in "every reasonable prompt".

Not at all. It's still super sensitive. You may think that prompt sensitivity is a thing from 2022, but no, it's alive and well and has never been more severe. Good LLM post-training can alleviate this a bit, but it actually sometimes comes at the cost of model steerability. The less sensitive the model is, the harder it is to teach it nuanced tasks! It clings to one "mode" understanding.

Third, because of all this, every single result on the regular benchmarks for math/coding with recent models that are *aggressively* mid- and post-trained to be good at the exact same math/coding benchmarks are essentially useless to me. I have been saying this publicly (and privately to my students) for ages.

Why the whole community, as a whole, chooses to continue publishing essentially meaningless results BEFORE THEY STARTED THE PROJECT is beyond me. Simply do not work on areas that have this many confounders. It's not possible to do science like that, especially with a not-fully-open model. (BTW even an open model is closed. LLMs are trained, not programmed; they're too emergent.)

Fourth, what's the takeaway? The takeaway is that:

(1) RL on Qwen for math helps for spurious reasons because the model already knows this stuff, and just needs nudges to align with the downstream evaluation.

(2) But any effect of (1) above will be hugely exaggerated if there's even a slight mismatch between your (potentially EXTREMELY reasonable) prompt in your evals and the prompt used by the Qwen team. Is this your fault? IMO *no*, our community's only mistaken decision was sadly working on over-saturated meaningless math/coding benchmarks.

(Btw the meaningless-ness is always with respect to a specific model. The same benchmark could be extremely meaningful if you pick up, idk, Llama 2 or something.)

(3) Is this the fault of the Qwen team? Well, idk. It's not like it's their job to make their model convenient-for-researchers-who-want-to-study-post-training.

(4) We have a mini-paradigm crisis here. A lot of y'all say things like "turns out this RL runs was just aligning the model with the output format". Is this a bad thing or a good thing? It depends entirely on your frame. If the goal of the system is to be a programmatic component, then parsing and reliable presentation is an actual goal. If the goal of the system is user-facing and "to be good at math", then yes this is entirely a hack.

Which one is your goal? My guess is that most people haven't really thought about it.

So what's the verdict? First, Don't work on saturated stuff, as a general rule. Just don't even try; too many confounders. Second, RL has shown its value for "downstream alignment". Getting the plumbing to work and the pieces to align together for some downstream system configuration/reward.

For small models, "capability" gains so far seem to be entirely a function of mid-training, not the RL you're doing. For big models, we actually have no clear open evidence of anything just yet. All the noise you've been hearing for the last 6 months is vacuous under any scrutiny.

Sorry for the messy post (i've written all this in some 7 mins somehow?) but hope this helps. Idk if a super quick-n-dirty rant like this is ever going to be useful for anyone. But if it can help 2-3 researchers "get" what's going on, I've succeeded. Good luck, all.
Aug 19, 2024 4 tweets 2 min read
Some personal news: I'm thrilled to have joined @Databricks @DbrxMosaicAI as a Research Scientist last month, before I start as MIT faculty in July 2025!

Expect increased investment into the open-source DSPy community, new research, & strong emphasis on production concerns 🧵. There couldn't have been a better fit for me to postdoc at prior to MIT, given @Databrick's:

- Track record of OSS impact like @ApacheSpark, @MLflow,
- Central role in Enterprise apps,
- Joining w/ @DbrxMosaicAI research,
- All the cool internal stuff they've been doing w/ DSPy!
Aug 19, 2024 5 tweets 2 min read
🧵What's next in DSPy 2.5? And DSPy 3.0?

I'm excited to share an early sketch of the DSPy Roadmap, a document we'll expand and maintain as more DSPy releases ramp up.

The goal is to communicate our objectives, milestones, & efforts and to solicit input—and help!—from everyone. Image Roadmap:

To make LMs useful, we must shift from ad-hoc prompting to systematic programming. DSPy builds the abstractions, design patterns, optimizers—and
community!—toward this goal.

We've gone a long way since the core research started in Feb 2022,github.com/stanfordnlp/ds…
Jun 18, 2024 7 tweets 3 min read
🚨Announcing the largest study focused on *how* to optimize the prompts within LM programs, a key DSPy challenge.

Should we use LMs to… Craft instructions? Self-generate examples? Handle credit assignment? Specify a Bayesian model?

By @kristahopsalong* @michaelryan207* &team🧵 Image 📰:

We formally define the optimization problem and introduce several strategies, e.g. allowing the optimizer to:
1⃣ Browse the program & data
2⃣Learn a mini-batch surrogate model to find promising combinations
3⃣Meta-optimize how LMs write instructions! arxiv.org/abs/2406.11695
Image
Jan 10, 2024 9 tweets 3 min read
The LM dev stack will soon evolve dramatically.

To start, @hwchase17 & I tackle interoperability: Now LangChain users can compile LCEL chains with DSPy optimizers⚡️

To illustrate, we compile LCEL for a long-form RAG app, delivering 10-20% higher output quality.

Let's dive in⤵️ Image The DSPy vision is to push for a sane **stack** of lower- to higher-level LM frameworks, learning from the DNN abstraction space.

Think PyTorch <> HF Transformers, but for LM apps.

This collab w @hwchase17 is the beginning of a lot in this space, and we're looking for feedback!
Dec 21, 2023 5 tweets 2 min read
There's an important missing perspective in the "GPT-4 is still unmatched" conversation:

It's a process (of good engineering at scale), not some secret sauce.

To understand, let's go back to 2000s/2010s when the gap between "open" IR and closed Google Search grew very large. 🧵 Note: I have no inside OpenAI info & I'm uninterested in individual LMs. Expressive power lies in the *program* wielding the LM

As commercial search matured, the IR field went from a very active and hot area of research circa 2000 to a much less active one 2004ish through 2018⤵️ Image
Dec 20, 2023 14 tweets 5 min read
A🧵on beating the hardware lottery for retrieval: the internals of the late interaction stack.

ColBERT introduced a quirky multi-vector retrieval architecture. It does wonders for quality.

But how can it search 100M docs in 0.1 sec on CPU? Or store 1 billion embeddings in 20GB? Image This is thread #2 (out of three) on late interaction.

We'll discuss algorithms & infrastructure for efficient retrieval, ones never presented together before.

In thread #1, I covered the motivation, scoring function, and early results of ColBERT:

Dec 18, 2023 13 tweets 4 min read
Progress on dense retrievers is saturating.

The best retrievers in 2024 will apply new forms of late interaction, i.e. scalable attention-like scoring for multi-vector embeddings.

A🧵on late interaction, how it works efficiently, and why/where it's been shown to improve quality Image Say you have 1M documents. With infinite GPUs, what would your retriever look like?

Maybe a cross-encoder? Finetune a large LM to take <query,doc> pairs. Run it 1M times to get a score for all docs.

Expressive! Given a query, the LM can pay attention to every detail in the doc! Image
Dec 16, 2023 9 tweets 2 min read
Actionable, extended analogies are underrated.

Here's one: LMs are stochastic *devices* for pattern processing. Like CPUs or GPUs, but Language Processing Units (LPUs).

Assembly? Higher-Level Languages? Compilers?

Understanding this analogy resolves several major questions.🧵 [A] Prompt Engineering

If LPUs are devices, then prompting is coding in assembly.

(1) The LPU understands low-level prompts directly.

(2) They're a error-prone and hard to hand-optimize.

(3) Different LPUs have different assembly (ISAs). Prompts NOT compatible across LMs!
Oct 11, 2023 11 tweets 4 min read
A cool thread yesterday used GPT4 ($50), a 500-word ReAct prompt, and ~400 lines of code to finetune Llama2-7B to get 26% HotPotQA EM.

Let's use 30 lines of DSPy—without any hand-written prompts or any calls to OpenAI ($0)—to teach a 9x smaller T5 (770M) model to get 39% EM!

🧵 Image If you want to do this in a notebook with step-by-step code:



Otherwise, follow along w/ slightly simplified code!

First, we set the default retriever as ColBERT and the default LM as Llama2-13b. We'll use llama (not GPT4!) to teach T5-Large (770M). github.com/stanfordnlp/ds…
Image
Jan 24, 2023 8 tweets 4 min read
Introducing Demonstrate–Search–Predict (𝗗𝗦𝗣), a framework for composing search and LMs w/ up to 120% gains over GPT-3.5.

No more prompt engineering.❌

Describe a high-level strategy as imperative code and let 𝗗𝗦𝗣 deal with prompts and queries.🧵

arxiv.org/abs/2212.14024 Instead of crafting a prompt for the LM, you write a short 𝗗𝗦𝗣 program that assigns small tasks to the LM and a retrieval model (RM) in deliberate powerful pipelines.

Simple 𝗗𝗦𝗣 programs outperform GPT-3.5, retrieve-&-read and self-ask by up to 𝟭𝟮𝟬%, 𝟰𝟬% and 𝟮𝟵𝟬%