Post

https://twitter.com/MaximeRivest/status/1959713203291607281

https://x.com/lateinteraction/status/1959710213780443528

More from @lateinteraction

Omar Khattab

@lateinteraction

Jun 15

# On why I dislike the name "multi-agent" for LLM systems.

When you talk about these systems as "multi-agent", you evoke free-form social, communication, and goal-alignment problems that don't even need to exist.

You're just architecting a single functional system, not a society.

If you're doing this properly, you're defining *structured* contracts between the modules, and you control (or delegate!) information flow. You ensure each module has access to all information/tools it may need, and ideally nothing else.

It's a new type of software, of course, and we need a lot of empirical insights and good tooling! I know that, I've worked on nothing but these types of systems for close to 6 years.

But what's involved is good system architecture and all the decisions *that* entails, with highly structured module contracts, which are fuzzy in *just* the right places. Most of the time, it need not become some kind of social coordination between employees/agents who have conflicting goals.

Now, here's the BIG catch. A lot of the hard-coded architecture decisions here are (necessarily) myopic, because they depend on ephemeral things like "how long the context length of today's model is" or "how good the model is at dividing tasks".

This is why we need to think about programming languages and query languages that decouple our fundamental intent (and information flow) from the lower-level tricks that make them work.

This is not new: Modularity in conventional programming languages is an illusion that helps the *programmer* be productive. It's how we can reason about and maintain big systems.

Under the hood, any compiler worth its salt is breaking down so much of this modularity illusion under the hood — with function inlining, dead code elimination, fusing instructions, vectorizing SIMD work and parallelizing other tasks, etc.

Hope this helps!

tl;dr It’s programming, not policy or management.

(I do actually think there’s a place for policy/management-style abstractions. It’s just not appropriate here and not what any of these systems are doing in the first place.)

Read 4 tweets

Omar Khattab

@lateinteraction

May 29

Sigh, it's a bit of a mess.

Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place.

First, I have long advised from this account to simply avoid the benchmarks that the LLM trainers/providers work with. They're vastly more contaminated than you think. They're contaminated not only in terms of the task distribution, and not only in terms of train vs test, but actually also in terms of *the prompts that work*.

Second, let me elaborate on this prompts part. When a team trains an LLM to be good at math, they aren't just "making the LLM good at math". They're usually working with the same prompt template they'll use for evaluation. Because LLMs are statistical models, not logical computers, learning to be good at math in one prompt template does NOT mean that the model is good at math in "every reasonable prompt".

Not at all. It's still super sensitive. You may think that prompt sensitivity is a thing from 2022, but no, it's alive and well and has never been more severe. Good LLM post-training can alleviate this a bit, but it actually sometimes comes at the cost of model steerability. The less sensitive the model is, the harder it is to teach it nuanced tasks! It clings to one "mode" understanding.

Third, because of all this, every single result on the regular benchmarks for math/coding with recent models that are *aggressively* mid- and post-trained to be good at the exact same math/coding benchmarks are essentially useless to me. I have been saying this publicly (and privately to my students) for ages.

Why the whole community, as a whole, chooses to continue publishing essentially meaningless results BEFORE THEY STARTED THE PROJECT is beyond me. Simply do not work on areas that have this many confounders. It's not possible to do science like that, especially with a not-fully-open model. (BTW even an open model is closed. LLMs are trained, not programmed; they're too emergent.)

Fourth, what's the takeaway? The takeaway is that:

(1) RL on Qwen for math helps for spurious reasons because the model already knows this stuff, and just needs nudges to align with the downstream evaluation.

(2) But any effect of (1) above will be hugely exaggerated if there's even a slight mismatch between your (potentially EXTREMELY reasonable) prompt in your evals and the prompt used by the Qwen team. Is this your fault? IMO *no*, our community's only mistaken decision was sadly working on over-saturated meaningless math/coding benchmarks.

(Btw the meaningless-ness is always with respect to a specific model. The same benchmark could be extremely meaningful if you pick up, idk, Llama 2 or something.)

(3) Is this the fault of the Qwen team? Well, idk. It's not like it's their job to make their model convenient-for-researchers-who-want-to-study-post-training.

(4) We have a mini-paradigm crisis here. A lot of y'all say things like "turns out this RL runs was just aligning the model with the output format". Is this a bad thing or a good thing? It depends entirely on your frame. If the goal of the system is to be a programmatic component, then parsing and reliable presentation is an actual goal. If the goal of the system is user-facing and "to be good at math", then yes this is entirely a hack.

Which one is your goal? My guess is that most people haven't really thought about it.

So what's the verdict? First, Don't work on saturated stuff, as a general rule. Just don't even try; too many confounders. Second, RL has shown its value for "downstream alignment". Getting the plumbing to work and the pieces to align together for some downstream system configuration/reward.

For small models, "capability" gains so far seem to be entirely a function of mid-training, not the RL you're doing. For big models, we actually have no clear open evidence of anything just yet. All the noise you've been hearing for the last 6 months is vacuous under any scrutiny.

Sorry for the messy post (i've written all this in some 7 mins somehow?) but hope this helps.

Idk if a super quick-n-dirty rant like this is ever going to be useful for anyone. But if it can help 2-3 researchers "get" what's going on, I've succeeded. Good luck, all.

i really, really don't want to turn this science post to an "ad" for my research, but for context: this is why I spend a stupid amount of time on ABSTRACTIONS for LLMs

It helps us eliminate a lot of confounders when we standardize a lot of this stuff! A bitter lesson to hold

Read 4 tweets

Omar Khattab

@lateinteraction

Aug 19, 2024

https://twitter.com/lateinteraction/status/1825594011484303596

Some personal news: I'm thrilled to have joined @Databricks @DbrxMosaicAI as a Research Scientist last month, before I start as MIT faculty in July 2025!

Expect increased investment into the open-source DSPy community, new research, & strong emphasis on production concerns 🧵.

https://twitter.com/lateinteraction/status/1825594011484303596

There couldn't have been a better fit for me to postdoc at prior to MIT, given @Databrick's:

- Track record of OSS impact like @ApacheSpark, @MLflow,
- Central role in Enterprise apps,
- Joining w/ @DbrxMosaicAI research,
- All the cool internal stuff they've been doing w/ DSPy!

How will the DSPy OSS team look?

Exactly the same — with help from the Databricks OSS team, working with the incredible core team from Anyscale, Normal, Dashworks, Zenbase, Weaviate as well as Stanford, Berkeley, CMU, Ghent, IIT-B, Waterloo, & soon MIT.

github.com/stanfordnlp/ds…

Read 4 tweets

Omar Khattab

@lateinteraction

Aug 19, 2024

🧵What's next in DSPy 2.5? And DSPy 3.0?

I'm excited to share an early sketch of the DSPy Roadmap, a document we'll expand and maintain as more DSPy releases ramp up.

The goal is to communicate our objectives, milestones, & efforts and to solicit input—and help!—from everyone.

Roadmap:

To make LMs useful, we must shift from ad-hoc prompting to systematic programming. DSPy builds the abstractions, design patterns, optimizers—and
community!—toward this goal.

We've gone a long way since the core research started in Feb 2022,github.com/stanfordnlp/ds…

but we still have plenty to work on! Upcoming DSPy releases will:

1. Polish the core functionality.
2. Develop more accurate, lower-cost optimizers.
3. Build end-to-end tutorials from DSPy’s ML workflow to deployment.
4. Shift towards more interactive optimizers & tracking.

Read 5 tweets

Omar Khattab

@lateinteraction

Jun 18, 2024

🚨Announcing the largest study focused on *how* to optimize the prompts within LM programs, a key DSPy challenge.

Should we use LMs to… Craft instructions? Self-generate examples? Handle credit assignment? Specify a Bayesian model?

By @kristahopsalong* @michaelryan207* &team🧵

📰:

We formally define the optimization problem and introduce several strategies, e.g. allowing the optimizer to:
1⃣ Browse the program & data
2⃣Learn a mini-batch surrogate model to find promising combinations
3⃣Meta-optimize how LMs write instructions! arxiv.org/abs/2406.11695

We compose 6 strategies into DSPy optimizers. We evaluate them on LangProBe, the first LM Program Benchmark, designed around hypotheses from what we see DSPy users do in the wild.

We derive 5 key lessons from this—with one key outcome: We got a new best DSPy optimizer, MIPROv2!

Read 7 tweets

Omar Khattab

@lateinteraction

Jan 10, 2024

The LM dev stack will soon evolve dramatically.

To start, @hwchase17 & I tackle interoperability: Now LangChain users can compile LCEL chains with DSPy optimizers⚡️

To illustrate, we compile LCEL for a long-form RAG app, delivering 10-20% higher output quality.

Let's dive in⤵️

The DSPy vision is to push for a sane **stack** of lower- to higher-level LM frameworks, learning from the DNN abstraction space.

Think PyTorch <> HF Transformers, but for LM apps.

This collab w @hwchase17 is the beginning of a lot in this space, and we're looking for feedback!

For LM app devs, there's a lot to love about LangChain. Batteries-included infra, streaming/tracing, integrations, & community.

For those building and tuning LM systems, automatic optimizers for prompting/finetuning and PyTorch-esque level of control make DSPy indispensable.

Read 9 tweets

Share this page!

Enter URL or ID to Unroll

Omar Khattab

Try unrolling a thread yourself!

More from @lateinteraction

Omar Khattab

Omar Khattab

Omar Khattab

Omar Khattab

Omar Khattab

Omar Khattab

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!