More from @billyuchenlin

Bill Yuchen Lin

@billyuchenlin

Feb 5

If you're interested in LLMs like o1 and R1 for complex reasoning, check out this paper — we show that logical reasoning tasks are ideal for evaluating and understanding their scaling limits.

🦓 ZebraLogic-Bench is a dataset of 1K constraint satisfaction problems (CSPs) structured as logic grid puzzles. Designed for precise control over complexity and generalization, it serves as an evaluation framework for testing LLMs on non-monotonic reasoning. Also, its complexity metrics—search space size and number of Z3 conflicts—enable the study of scaling behavior in reasoning models.

Some key findings below 👇

1️⃣ Scaling model size alone won’t break the curse of complexity. Larger models only improve on very easy problems, while once difficulty crosses a threshold, even a 405B model’s accuracy drops to nearly zero.

2️⃣ Scaling test-time compute (longer CoTs) is the most promising approach. The gap between regular LLMs and reasoning-optimized ones is huge. But even the best reasoning models degrade sharply as problem complexity increases.

3️⃣ Great reasoning models tend to scale CoT tokens with Z3 conflicts very well. We found an almost linear correlation for o1-full. Note that each Z3 conflict in CSP needs some kind of backtracking in reasoning so this explains why those LLMs like o1/R1 can do so well in non-monotonic reasoning problems.

4️⃣ Repeated sampling + RM models / self-verification prompting? Not effective. Tried, tested, didn’t work well.

We hope 🦓 ZebraLogic can facilitate future work on RL for reasoning LLMs and benefit awesome open research!

Links to the paper, dataset, and leaderboard are in the thread. We also share example details and data creation insights there. 🧵We show that synthetic logical puzzles are ideal for studying reasoning LLMs like o1 & R1.

[1/n]

Each ZebraLogic example is a logic grid puzzle with a background scenario and a set of clues that define constraints for assigning values to N houses across M attributes. A reasoning model is tasked with producing a final solution table, correctly filling in all value assignments. Smaller grids or straightforward clues make the problem easier, but complexity grows rapidly with larger grids or trickier clue structures.

To generate these tasks, we use a variety of attributes, values, and language templates. Each puzzle is confirmed to have a unique solution, and its difficulty is gauged by running a Z3 solver multiple times, measuring the average number of conflicts it resolves per problem. These conflicts force reasoning models—like o1 or R1—to engage in backtracking, self-verification, error correction, process of elimination, and other branching strategies, mimicking the "Wait, let me check again" moments of human reasoning.

Though synthetic, these puzzles mirror constraint-satisfaction patterns found in everyday reasoning tasks such as scheduling, resource allocation, or travel planning. Also, we're planning to create new & harder tasks such that we can keep the evaluation alive and useful. Please stay tuned!

[2/n]

On ZebraLogic leaderboard, @deepseek_ai 's R1 is quite close to o1-full—performing slightly better on simpler tasks but falling behind on the extremely difficult ones. The other non-reasoning LLMs lag significantly in this case. However, their rankings on small-to-medium-sized tasks still provide valuable insight into their potential for future RL optimization for reasoning.

The latest leaderboard is here on @huggingface : hf.co/spaces/WildEva…

Read 4 tweets

Bill Yuchen Lin

@billyuchenlin

Jul 18, 2024

We've been re-evaluating LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. Introducing 🔥 ZeroEval: a simple unified framework for evaluating LLMs. Two initial tasks are MMLU-Redux and GSM. Btw, GPT-4o-mini @openai is super great. [1/n]

Github: github.com/yuchenlin/Zero…

In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome! [2/n]

Add your task to ZeroEval: ⬇️ github.com/yuchenlin/Zero…

Forgetting issue can be evident for methods like DPO and SimPO. For example, SimPO (on top of Llama3-8b-instruct) can significantly hurt both MMLU-Redux and GSM performance. Mitigating alignment tax is still a hard problem! [3/n]

More results: github.com/yuchenlin/Zero…

Read 4 tweets

Bill Yuchen Lin

@billyuchenlin

Apr 18, 2021

@huggingface

Introducing the beta version of 𝙵𝚎𝚍𝙽𝙻𝙿, an open-source research platform for federated learning in NLP. Thanks to the awesome @huggingface and FedML, we integrate Transformer models and many popular FL methods (FedAvg, FedOpt, etc.). 🥳 Code: github.com/FedML-AI/FedNLP [1/4]

The FedNLP platform supports various task formulations (e.g., classification, seq tagging, reading comprehension, seq2seq, etc.) for realistic NLP applications. We implement many non-IID partitioning strategies (wrt. label, quantity, feature) that are common for FL. [2/4]

Our experiments reveal that there exists a large gap between learning on decentralized and centralized datasets --- opening exciting future research aimed at developing FL methods suited to NLP tasks and beyond: personalization, robustness, safety, fairness, and so on! [3/4]

Read 4 tweets

Share this page!

Enter URL or ID to Unroll

Bill Yuchen Lin

Try unrolling a thread yourself!

More from @billyuchenlin

Bill Yuchen Lin

Bill Yuchen Lin

Bill Yuchen Lin

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!