We've been re-evaluating LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. Introducing 🔥 ZeroEval: a simple unified framework for evaluating LLMs. Two initial tasks are MMLU-Redux and GSM. Btw, GPT-4o-mini @openai is super great. [1/n]
In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome! [2/n]
Forgetting issue can be evident for methods like DPO and SimPO. For example, SimPO (on top of Llama3-8b-instruct) can significantly hurt both MMLU-Redux and GSM performance. Mitigating alignment tax is still a hard problem! [3/n]
We find that gemma-2-27b-it on @vllm_project may have some problems. Its performance is much worse than using @togethercompute 's api or the vanilla @huggingface inference. 9B on vLLM is okay, though [4/n]
• • •
Missing some Tweet in this thread? You can try to
force a refresh
How should we maximize the planning ability of #LLM while reducing the computation cost? 🚀 Introducing SwiftSage, an agent inspired by “fast & slow thinking”, which solves complex interactive tasks much better than prior agents (e.g., DRRN, SayCan, ReAct, and Reflexion). [1/n]
💡 Let’s compare SwiftSage w/ prior agents: SayCan reranks actions w/ affordance; ReAct has subgoal planning; Reflexion adds self-reflection. However, these methods can be expensive and yet brittle. It’s also hard to execute & ground their error-prone actions/plans in env. [2/n]
🌠 A closer look at the 2 parts of SwfitSage: The Swift is a small LM (770m) for fast thinking. It’s super familiar with target env by imitation learning. The Sage prompts LLMs for slow thinking in two stages: plan & ground, and get an action buffer for interacting w/ env. [3/n]
Introducing the beta version of 𝙵𝚎𝚍𝙽𝙻𝙿, an open-source research platform for federated learning in NLP. Thanks to the awesome @huggingface and FedML, we integrate Transformer models and many popular FL methods (FedAvg, FedOpt, etc.). 🥳 Code: github.com/FedML-AI/FedNLP [1/4]
The FedNLP platform supports various task formulations (e.g., classification, seq tagging, reading comprehension, seq2seq, etc.) for realistic NLP applications. We implement many non-IID partitioning strategies (wrt. label, quantity, feature) that are common for FL. [2/4]
Our experiments reveal that there exists a large gap between learning on decentralized and centralized datasets --- opening exciting future research aimed at developing FL methods suited to NLP tasks and beyond: personalization, robustness, safety, fairness, and so on! [3/4]