How should we maximize the planning ability of #LLM while reducing the computation cost? 🚀 Introducing SwiftSage, an agent inspired by “fast & slow thinking”, which solves complex interactive tasks much better than prior agents (e.g., DRRN, SayCan, ReAct, and Reflexion). [1/n]
💡 Let’s compare SwiftSage w/ prior agents: SayCan reranks actions w/ affordance; ReAct has subgoal planning; Reflexion adds self-reflection. However, these methods can be expensive and yet brittle. It’s also hard to execute & ground their error-prone actions/plans in env. [2/n]
🌠 A closer look at the 2 parts of SwfitSage: The Swift is a small LM (770m) for fast thinking. It’s super familiar with target env by imitation learning. The Sage prompts LLMs for slow thinking in two stages: plan & ground, and get an action buffer for interacting w/ env. [3/n]
✨ SwiftSage’s features: 1⃣️ Use imitation learning to train a small LM for fast thinking. 2⃣️ Only prompt LLMs when needed (e.g., no reward after 5 steps). 3⃣️ Separate planning and grounding subgoals when prompting LLMs. 4⃣️ Get multiple actions (~5) per LLM call.[4/n]
🏆 We use ScienceWorld for evaluation. It’s a text-based engine, has 30 types of tasks, 10 locations, 200+ objects, and 25 actions. The tasks can be super complex and long-horizon. It also requires exception handling. SwiftSage is 2x better and costs much less than others! [5/n]
If you're interested in LLMs like o1 and R1 for complex reasoning, check out this paper — we show that logical reasoning tasks are ideal for evaluating and understanding their scaling limits.
🦓 ZebraLogic-Bench is a dataset of 1K constraint satisfaction problems (CSPs) structured as logic grid puzzles. Designed for precise control over complexity and generalization, it serves as an evaluation framework for testing LLMs on non-monotonic reasoning. Also, its complexity metrics—search space size and number of Z3 conflicts—enable the study of scaling behavior in reasoning models.
Some key findings below 👇
1️⃣ Scaling model size alone won’t break the curse of complexity. Larger models only improve on very easy problems, while once difficulty crosses a threshold, even a 405B model’s accuracy drops to nearly zero.
2️⃣ Scaling test-time compute (longer CoTs) is the most promising approach. The gap between regular LLMs and reasoning-optimized ones is huge. But even the best reasoning models degrade sharply as problem complexity increases.
3️⃣ Great reasoning models tend to scale CoT tokens with Z3 conflicts very well. We found an almost linear correlation for o1-full. Note that each Z3 conflict in CSP needs some kind of backtracking in reasoning so this explains why those LLMs like o1/R1 can do so well in non-monotonic reasoning problems.
4️⃣ Repeated sampling + RM models / self-verification prompting? Not effective. Tried, tested, didn’t work well.
We hope 🦓 ZebraLogic can facilitate future work on RL for reasoning LLMs and benefit awesome open research!
Links to the paper, dataset, and leaderboard are in the thread. We also share example details and data creation insights there. 🧵We show that synthetic logical puzzles are ideal for studying reasoning LLMs like o1 & R1.
[1/n]
Each ZebraLogic example is a logic grid puzzle with a background scenario and a set of clues that define constraints for assigning values to N houses across M attributes. A reasoning model is tasked with producing a final solution table, correctly filling in all value assignments. Smaller grids or straightforward clues make the problem easier, but complexity grows rapidly with larger grids or trickier clue structures.
To generate these tasks, we use a variety of attributes, values, and language templates. Each puzzle is confirmed to have a unique solution, and its difficulty is gauged by running a Z3 solver multiple times, measuring the average number of conflicts it resolves per problem. These conflicts force reasoning models—like o1 or R1—to engage in backtracking, self-verification, error correction, process of elimination, and other branching strategies, mimicking the "Wait, let me check again" moments of human reasoning.
Though synthetic, these puzzles mirror constraint-satisfaction patterns found in everyday reasoning tasks such as scheduling, resource allocation, or travel planning. Also, we're planning to create new & harder tasks such that we can keep the evaluation alive and useful. Please stay tuned!
[2/n]
On ZebraLogic leaderboard, @deepseek_ai 's R1 is quite close to o1-full—performing slightly better on simpler tasks but falling behind on the extremely difficult ones. The other non-reasoning LLMs lag significantly in this case. However, their rankings on small-to-medium-sized tasks still provide valuable insight into their potential for future RL optimization for reasoning.
We've been re-evaluating LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. Introducing 🔥 ZeroEval: a simple unified framework for evaluating LLMs. Two initial tasks are MMLU-Redux and GSM. Btw, GPT-4o-mini @openai is super great. [1/n]
In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome! [2/n]
Forgetting issue can be evident for methods like DPO and SimPO. For example, SimPO (on top of Llama3-8b-instruct) can significantly hurt both MMLU-Redux and GSM performance. Mitigating alignment tax is still a hard problem! [3/n]
Introducing the beta version of 𝙵𝚎𝚍𝙽𝙻𝙿, an open-source research platform for federated learning in NLP. Thanks to the awesome @huggingface and FedML, we integrate Transformer models and many popular FL methods (FedAvg, FedOpt, etc.). 🥳 Code: github.com/FedML-AI/FedNLP [1/4]
The FedNLP platform supports various task formulations (e.g., classification, seq tagging, reading comprehension, seq2seq, etc.) for realistic NLP applications. We implement many non-IID partitioning strategies (wrt. label, quantity, feature) that are common for FL. [2/4]
Our experiments reveal that there exists a large gap between learning on decentralized and centralized datasets --- opening exciting future research aimed at developing FL methods suited to NLP tasks and beyond: personalization, robustness, safety, fairness, and so on! [3/4]