Bill Yuchen Lin 🤖 Profile picture
Jul 18 4 tweets 3 min read Read on X
We've been re-evaluating LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. Introducing 🔥 ZeroEval: a simple unified framework for evaluating LLMs. Two initial tasks are MMLU-Redux and GSM. Btw, GPT-4o-mini @openai is super great. [1/n]

Github: github.com/yuchenlin/Zero…Image
Image
In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome! [2/n]

Add your task to ZeroEval: ⬇️ github.com/yuchenlin/Zero…
Image
Forgetting issue can be evident for methods like DPO and SimPO. For example, SimPO (on top of Llama3-8b-instruct) can significantly hurt both MMLU-Redux and GSM performance. Mitigating alignment tax is still a hard problem! [3/n]

More results: github.com/yuchenlin/Zero…
Image
We find that gemma-2-27b-it on @vllm_project may have some problems. Its performance is much worse than using @togethercompute 's api or the vanilla @huggingface inference. 9B on vLLM is okay, though [4/n] Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Bill Yuchen Lin 🤖

Bill Yuchen Lin 🤖 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @billyuchenlin

May 30, 2023
How should we maximize the planning ability of #LLM while reducing the computation cost? 🚀 Introducing SwiftSage, an agent inspired by “fast & slow thinking”, which solves complex interactive tasks much better than prior agents (e.g., DRRN, SayCan, ReAct, and Reflexion). [1/n] Image
💡 Let’s compare SwiftSage w/ prior agents: SayCan reranks actions w/ affordance; ReAct has subgoal planning; Reflexion adds self-reflection. However, these methods can be expensive and yet brittle. It’s also hard to execute & ground their error-prone actions/plans in env. [2/n] Image
🌠 A closer look at the 2 parts of SwfitSage: The Swift is a small LM (770m) for fast thinking. It’s super familiar with target env by imitation learning. The Sage prompts LLMs for slow thinking in two stages: plan & ground, and get an action buffer for interacting w/ env. [3/n] ImageImageImage
Read 6 tweets
Apr 18, 2021
Introducing the beta version of 𝙵𝚎𝚍𝙽𝙻𝙿, an open-source research platform for federated learning in NLP. Thanks to the awesome @huggingface and FedML, we integrate Transformer models and many popular FL methods (FedAvg, FedOpt, etc.). 🥳 Code: github.com/FedML-AI/FedNLP [1/4]
The FedNLP platform supports various task formulations (e.g., classification, seq tagging, reading comprehension, seq2seq, etc.) for realistic NLP applications. We implement many non-IID partitioning strategies (wrt. label, quantity, feature) that are common for FL. [2/4]
Our experiments reveal that there exists a large gap between learning on decentralized and centralized datasets --- opening exciting future research aimed at developing FL methods suited to NLP tasks and beyond: personalization, robustness, safety, fairness, and so on! [3/4]
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(