Research @allen_ai. LLM * evaluation, synthetic data for alignment and agents, etc. Previously: @GoogleAI & @MetaAI FAIR @nlp_usc
Feb 5 • 4 tweets • 4 min read
If you're interested in LLMs like o1 and R1 for complex reasoning, check out this paper — we show that logical reasoning tasks are ideal for evaluating and understanding their scaling limits.
🦓 ZebraLogic-Bench is a dataset of 1K constraint satisfaction problems (CSPs) structured as logic grid puzzles. Designed for precise control over complexity and generalization, it serves as an evaluation framework for testing LLMs on non-monotonic reasoning. Also, its complexity metrics—search space size and number of Z3 conflicts—enable the study of scaling behavior in reasoning models.
Some key findings below 👇
1️⃣ Scaling model size alone won’t break the curse of complexity. Larger models only improve on very easy problems, while once difficulty crosses a threshold, even a 405B model’s accuracy drops to nearly zero.
2️⃣ Scaling test-time compute (longer CoTs) is the most promising approach. The gap between regular LLMs and reasoning-optimized ones is huge. But even the best reasoning models degrade sharply as problem complexity increases.
3️⃣ Great reasoning models tend to scale CoT tokens with Z3 conflicts very well. We found an almost linear correlation for o1-full. Note that each Z3 conflict in CSP needs some kind of backtracking in reasoning so this explains why those LLMs like o1/R1 can do so well in non-monotonic reasoning problems.
4️⃣ Repeated sampling + RM models / self-verification prompting? Not effective. Tried, tested, didn’t work well.
We hope 🦓 ZebraLogic can facilitate future work on RL for reasoning LLMs and benefit awesome open research!
Links to the paper, dataset, and leaderboard are in the thread. We also share example details and data creation insights there. 🧵We show that synthetic logical puzzles are ideal for studying reasoning LLMs like o1 & R1.
[1/n]
Each ZebraLogic example is a logic grid puzzle with a background scenario and a set of clues that define constraints for assigning values to N houses across M attributes. A reasoning model is tasked with producing a final solution table, correctly filling in all value assignments. Smaller grids or straightforward clues make the problem easier, but complexity grows rapidly with larger grids or trickier clue structures.
To generate these tasks, we use a variety of attributes, values, and language templates. Each puzzle is confirmed to have a unique solution, and its difficulty is gauged by running a Z3 solver multiple times, measuring the average number of conflicts it resolves per problem. These conflicts force reasoning models—like o1 or R1—to engage in backtracking, self-verification, error correction, process of elimination, and other branching strategies, mimicking the "Wait, let me check again" moments of human reasoning.
Though synthetic, these puzzles mirror constraint-satisfaction patterns found in everyday reasoning tasks such as scheduling, resource allocation, or travel planning. Also, we're planning to create new & harder tasks such that we can keep the evaluation alive and useful. Please stay tuned!
[2/n]
Jul 18, 2024 • 4 tweets • 3 min read
We've been re-evaluating LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. Introducing 🔥 ZeroEval: a simple unified framework for evaluating LLMs. Two initial tasks are MMLU-Redux and GSM. Btw, GPT-4o-mini @openai is super great. [1/n]
Github: github.com/yuchenlin/Zero…
In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome! [2/n]
How should we maximize the planning ability of #LLM while reducing the computation cost? 🚀 Introducing SwiftSage, an agent inspired by “fast & slow thinking”, which solves complex interactive tasks much better than prior agents (e.g., DRRN, SayCan, ReAct, and Reflexion). [1/n]
💡 Let’s compare SwiftSage w/ prior agents: SayCan reranks actions w/ affordance; ReAct has subgoal planning; Reflexion adds self-reflection. However, these methods can be expensive and yet brittle. It’s also hard to execute & ground their error-prone actions/plans in env. [2/n]
Apr 18, 2021 • 4 tweets • 3 min read
Introducing the beta version of 𝙵𝚎𝚍𝙽𝙻𝙿, an open-source research platform for federated learning in NLP. Thanks to the awesome @huggingface and FedML, we integrate Transformer models and many popular FL methods (FedAvg, FedOpt, etc.). 🥳 Code: github.com/FedML-AI/FedNLP [1/4]
The FedNLP platform supports various task formulations (e.g., classification, seq tagging, reading comprehension, seq2seq, etc.) for realistic NLP applications. We implement many non-IID partitioning strategies (wrt. label, quantity, feature) that are common for FL. [2/4]