Bill Yuchen Lin 🤖 Profile picture
May 30, 2023 6 tweets 5 min read Read on X
How should we maximize the planning ability of #LLM while reducing the computation cost? 🚀 Introducing SwiftSage, an agent inspired by “fast & slow thinking”, which solves complex interactive tasks much better than prior agents (e.g., DRRN, SayCan, ReAct, and Reflexion). [1/n] Image
💡 Let’s compare SwiftSage w/ prior agents: SayCan reranks actions w/ affordance; ReAct has subgoal planning; Reflexion adds self-reflection. However, these methods can be expensive and yet brittle. It’s also hard to execute & ground their error-prone actions/plans in env. [2/n] Image
🌠 A closer look at the 2 parts of SwfitSage: The Swift is a small LM (770m) for fast thinking. It’s super familiar with target env by imitation learning. The Sage prompts LLMs for slow thinking in two stages: plan & ground, and get an action buffer for interacting w/ env. [3/n] ImageImageImage
✨ SwiftSage’s features: 1⃣️ Use imitation learning to train a small LM for fast thinking. 2⃣️ Only prompt LLMs when needed (e.g., no reward after 5 steps). 3⃣️ Separate planning and grounding subgoals when prompting LLMs. 4⃣️ Get multiple actions (~5) per LLM call.[4/n] Image
🏆 We use ScienceWorld for evaluation. It’s a text-based engine, has 30 types of tasks, 10 locations, 200+ objects, and 25 actions. The tasks can be super complex and long-horizon. It also requires exception handling. SwiftSage is 2x better and costs much less than others! [5/n] ImageImageImageImage
🔥SwiftSage implies the paradigm of small+large LMs is super promising for complex tasks! Work done w/ @allen_ai & @nlp_usc folks: @YejinChoinka @xiangrenNLP @chandra_bhagav @rajammanabrolu @faeze_brh et al.
🔗 Website: yuchenlin.xyz/swiftsage/
🔗 Paper: arxiv.org/abs/2305.17390

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Bill Yuchen Lin 🤖

Bill Yuchen Lin 🤖 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @billyuchenlin

Feb 5
If you're interested in LLMs like o1 and R1 for complex reasoning, check out this paper — we show that logical reasoning tasks are ideal for evaluating and understanding their scaling limits.

🦓 ZebraLogic-Bench is a dataset of 1K constraint satisfaction problems (CSPs) structured as logic grid puzzles. Designed for precise control over complexity and generalization, it serves as an evaluation framework for testing LLMs on non-monotonic reasoning. Also, its complexity metrics—search space size and number of Z3 conflicts—enable the study of scaling behavior in reasoning models.

Some key findings below 👇

1️⃣ Scaling model size alone won’t break the curse of complexity. Larger models only improve on very easy problems, while once difficulty crosses a threshold, even a 405B model’s accuracy drops to nearly zero.

2️⃣ Scaling test-time compute (longer CoTs) is the most promising approach. The gap between regular LLMs and reasoning-optimized ones is huge. But even the best reasoning models degrade sharply as problem complexity increases.

3️⃣ Great reasoning models tend to scale CoT tokens with Z3 conflicts very well. We found an almost linear correlation for o1-full. Note that each Z3 conflict in CSP needs some kind of backtracking in reasoning so this explains why those LLMs like o1/R1 can do so well in non-monotonic reasoning problems.

4️⃣ Repeated sampling + RM models / self-verification prompting? Not effective. Tried, tested, didn’t work well.

We hope 🦓 ZebraLogic can facilitate future work on RL for reasoning LLMs and benefit awesome open research!

Links to the paper, dataset, and leaderboard are in the thread. We also share example details and data creation insights there. 🧵We show that synthetic logical puzzles are ideal for studying reasoning LLMs like o1 & R1.

[1/n]Image
Each ZebraLogic example is a logic grid puzzle with a background scenario and a set of clues that define constraints for assigning values to N houses across M attributes. A reasoning model is tasked with producing a final solution table, correctly filling in all value assignments. Smaller grids or straightforward clues make the problem easier, but complexity grows rapidly with larger grids or trickier clue structures.

To generate these tasks, we use a variety of attributes, values, and language templates. Each puzzle is confirmed to have a unique solution, and its difficulty is gauged by running a Z3 solver multiple times, measuring the average number of conflicts it resolves per problem. These conflicts force reasoning models—like o1 or R1—to engage in backtracking, self-verification, error correction, process of elimination, and other branching strategies, mimicking the "Wait, let me check again" moments of human reasoning.

Though synthetic, these puzzles mirror constraint-satisfaction patterns found in everyday reasoning tasks such as scheduling, resource allocation, or travel planning. Also, we're planning to create new & harder tasks such that we can keep the evaluation alive and useful. Please stay tuned!

[2/n]Image
On ZebraLogic leaderboard, @deepseek_ai 's R1 is quite close to o1-full—performing slightly better on simpler tasks but falling behind on the extremely difficult ones. The other non-reasoning LLMs lag significantly in this case. However, their rankings on small-to-medium-sized tasks still provide valuable insight into their potential for future RL optimization for reasoning.

The latest leaderboard is here on @huggingface : hf.co/spaces/WildEva…Image
Read 4 tweets
Jul 18, 2024
We've been re-evaluating LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. Introducing 🔥 ZeroEval: a simple unified framework for evaluating LLMs. Two initial tasks are MMLU-Redux and GSM. Btw, GPT-4o-mini @openai is super great. [1/n]

Github: github.com/yuchenlin/Zero…Image
Image
In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome! [2/n]

Add your task to ZeroEval: ⬇️ github.com/yuchenlin/Zero…
Image
Forgetting issue can be evident for methods like DPO and SimPO. For example, SimPO (on top of Llama3-8b-instruct) can significantly hurt both MMLU-Redux and GSM performance. Mitigating alignment tax is still a hard problem! [3/n]

More results: github.com/yuchenlin/Zero…
Image
Read 4 tweets
Apr 18, 2021
Introducing the beta version of 𝙵𝚎𝚍𝙽𝙻𝙿, an open-source research platform for federated learning in NLP. Thanks to the awesome @huggingface and FedML, we integrate Transformer models and many popular FL methods (FedAvg, FedOpt, etc.). 🥳 Code: github.com/FedML-AI/FedNLP [1/4]
The FedNLP platform supports various task formulations (e.g., classification, seq tagging, reading comprehension, seq2seq, etc.) for realistic NLP applications. We implement many non-IID partitioning strategies (wrt. label, quantity, feature) that are common for FL. [2/4]
Our experiments reveal that there exists a large gap between learning on decentralized and centralized datasets --- opening exciting future research aimed at developing FL methods suited to NLP tasks and beyond: personalization, robustness, safety, fairness, and so on! [3/4]
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(