If not obvious, these 2 lines are doing the work of 60 lines, but better. They come with lots of implicit functionality for reliability, portability, and readiness for optimization.
I don't think the 2-liner is special. I think the 60-liners are dumb:
# On why I dislike the name "multi-agent" for LLM systems.
When you talk about these systems as "multi-agent", you evoke free-form social, communication, and goal-alignment problems that don't even need to exist.
You're just architecting a single functional system, not a society.
If you're doing this properly, you're defining *structured* contracts between the modules, and you control (or delegate!) information flow. You ensure each module has access to all information/tools it may need, and ideally nothing else.
It's a new type of software, of course, and we need a lot of empirical insights and good tooling! I know that, I've worked on nothing but these types of systems for close to 6 years.
But what's involved is good system architecture and all the decisions *that* entails, with highly structured module contracts, which are fuzzy in *just* the right places. Most of the time, it need not become some kind of social coordination between employees/agents who have conflicting goals.
Now, here's the BIG catch. A lot of the hard-coded architecture decisions here are (necessarily) myopic, because they depend on ephemeral things like "how long the context length of today's model is" or "how good the model is at dividing tasks".
This is why we need to think about programming languages and query languages that decouple our fundamental intent (and information flow) from the lower-level tricks that make them work.
This is not new: Modularity in conventional programming languages is an illusion that helps the *programmer* be productive. It's how we can reason about and maintain big systems.
Under the hood, any compiler worth its salt is breaking down so much of this modularity illusion under the hood — with function inlining, dead code elimination, fusing instructions, vectorizing SIMD work and parallelizing other tasks, etc.
Hope this helps!
tl;dr It’s programming, not policy or management.
(I do actually think there’s a place for policy/management-style abstractions. It’s just not appropriate here and not what any of these systems are doing in the first place.)
Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place.
First, I have long advised from this account to simply avoid the benchmarks that the LLM trainers/providers work with. They're vastly more contaminated than you think. They're contaminated not only in terms of the task distribution, and not only in terms of train vs test, but actually also in terms of *the prompts that work*.
Second, let me elaborate on this prompts part. When a team trains an LLM to be good at math, they aren't just "making the LLM good at math". They're usually working with the same prompt template they'll use for evaluation. Because LLMs are statistical models, not logical computers, learning to be good at math in one prompt template does NOT mean that the model is good at math in "every reasonable prompt".
Not at all. It's still super sensitive. You may think that prompt sensitivity is a thing from 2022, but no, it's alive and well and has never been more severe. Good LLM post-training can alleviate this a bit, but it actually sometimes comes at the cost of model steerability. The less sensitive the model is, the harder it is to teach it nuanced tasks! It clings to one "mode" understanding.
Third, because of all this, every single result on the regular benchmarks for math/coding with recent models that are *aggressively* mid- and post-trained to be good at the exact same math/coding benchmarks are essentially useless to me. I have been saying this publicly (and privately to my students) for ages.
Why the whole community, as a whole, chooses to continue publishing essentially meaningless results BEFORE THEY STARTED THE PROJECT is beyond me. Simply do not work on areas that have this many confounders. It's not possible to do science like that, especially with a not-fully-open model. (BTW even an open model is closed. LLMs are trained, not programmed; they're too emergent.)
Fourth, what's the takeaway? The takeaway is that:
(1) RL on Qwen for math helps for spurious reasons because the model already knows this stuff, and just needs nudges to align with the downstream evaluation.
(2) But any effect of (1) above will be hugely exaggerated if there's even a slight mismatch between your (potentially EXTREMELY reasonable) prompt in your evals and the prompt used by the Qwen team. Is this your fault? IMO *no*, our community's only mistaken decision was sadly working on over-saturated meaningless math/coding benchmarks.
(Btw the meaningless-ness is always with respect to a specific model. The same benchmark could be extremely meaningful if you pick up, idk, Llama 2 or something.)
(3) Is this the fault of the Qwen team? Well, idk. It's not like it's their job to make their model convenient-for-researchers-who-want-to-study-post-training.
(4) We have a mini-paradigm crisis here. A lot of y'all say things like "turns out this RL runs was just aligning the model with the output format". Is this a bad thing or a good thing? It depends entirely on your frame. If the goal of the system is to be a programmatic component, then parsing and reliable presentation is an actual goal. If the goal of the system is user-facing and "to be good at math", then yes this is entirely a hack.
Which one is your goal? My guess is that most people haven't really thought about it.
So what's the verdict? First, Don't work on saturated stuff, as a general rule. Just don't even try; too many confounders. Second, RL has shown its value for "downstream alignment". Getting the plumbing to work and the pieces to align together for some downstream system configuration/reward.
For small models, "capability" gains so far seem to be entirely a function of mid-training, not the RL you're doing. For big models, we actually have no clear open evidence of anything just yet. All the noise you've been hearing for the last 6 months is vacuous under any scrutiny.
Sorry for the messy post (i've written all this in some 7 mins somehow?) but hope this helps.
Idk if a super quick-n-dirty rant like this is ever going to be useful for anyone. But if it can help 2-3 researchers "get" what's going on, I've succeeded. Good luck, all.
i really, really don't want to turn this science post to an "ad" for my research, but for context: this is why I spend a stupid amount of time on ABSTRACTIONS for LLMs
It helps us eliminate a lot of confounders when we standardize a lot of this stuff! A bitter lesson to hold
Some personal news: I'm thrilled to have joined @Databricks @DbrxMosaicAI as a Research Scientist last month, before I start as MIT faculty in July 2025!
Expect increased investment into the open-source DSPy community, new research, & strong emphasis on production concerns 🧵.
There couldn't have been a better fit for me to postdoc at prior to MIT, given @Databrick's:
- Track record of OSS impact like @ApacheSpark, @MLflow,
- Central role in Enterprise apps,
- Joining w/ @DbrxMosaicAI research,
- All the cool internal stuff they've been doing w/ DSPy!
How will the DSPy OSS team look?
Exactly the same — with help from the Databricks OSS team, working with the incredible core team from Anyscale, Normal, Dashworks, Zenbase, Weaviate as well as Stanford, Berkeley, CMU, Ghent, IIT-B, Waterloo, & soon MIT.
I'm excited to share an early sketch of the DSPy Roadmap, a document we'll expand and maintain as more DSPy releases ramp up.
The goal is to communicate our objectives, milestones, & efforts and to solicit input—and help!—from everyone.
Roadmap:
To make LMs useful, we must shift from ad-hoc prompting to systematic programming. DSPy builds the abstractions, design patterns, optimizers—and
community!—toward this goal.
but we still have plenty to work on! Upcoming DSPy releases will:
1. Polish the core functionality. 2. Develop more accurate, lower-cost optimizers. 3. Build end-to-end tutorials from DSPy’s ML workflow to deployment. 4. Shift towards more interactive optimizers & tracking.
🚨Announcing the largest study focused on *how* to optimize the prompts within LM programs, a key DSPy challenge.
Should we use LMs to… Craft instructions? Self-generate examples? Handle credit assignment? Specify a Bayesian model?
By @kristahopsalong* @michaelryan207* &team🧵
📰:
We formally define the optimization problem and introduce several strategies, e.g. allowing the optimizer to:
1⃣ Browse the program & data
2⃣Learn a mini-batch surrogate model to find promising combinations
3⃣Meta-optimize how LMs write instructions! arxiv.org/abs/2406.11695
We compose 6 strategies into DSPy optimizers. We evaluate them on LangProBe, the first LM Program Benchmark, designed around hypotheses from what we see DSPy users do in the wild.
We derive 5 key lessons from this—with one key outcome: We got a new best DSPy optimizer, MIPROv2!