Latest Twitter Threads by @gneubig on Thread Reader App

Dec 19, 2024 • 13 tweets • 6 min read

How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?

In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks.

Why is this benchmark important?

Right now it is unclear how effective AI is at helping with real-world work. We hear extreme statements like:

> AI is overhyped, minimally helpful, and doesn’t generalize to new tasks
> AGI will automate all human work in the next few years

Dec 19, 2023 • 12 tweets • 9 min read

Google’s Gemini recently made waves as a major competitor to OpenAI’s GPT. Exciting! But we wondered:

How good is Gemini really?

At CMU, we performed an impartial, in-depth, and reproducible study comparing Gemini, GPT, and Mixtral.

Paper:
🧵 arxiv.org/abs/2312.11444

We compared accuracy across 6 different varieties of tasks:
* Knowledge-based QA (MMLU)
* Reasoning (BIG-Bench Hard)
* Math (GSM8k, SVAMP, ASDIV, MAWPS)
* Code Gen (HumanEval, ODEX)
* Translation (FLORES)
* Web Instruction Following (WebArena)

May 18, 2023 • 14 tweets • 9 min read

There are so many chatbots nowadays, it’s hard to keep up!

To help out, we made an open source tool for automatic comparison of chatbots, and created a report on LLaMa, Alpaca, Vicuna, ChatGPT, Cohere, etc.!

Report: github.com/zeno-ml/zeno-b…
Browser: zeno-ml-chatbot-report.hf.space

🧵⬇️

Our new tool, “Zeno Build” (github.com/zeno-ml/zeno-b…), aims to make it easier to build and evaluate systems using LMs, and includes:

* Interfaces to various open-source and API-based models
* Automatic evaluation of the responses
* Visualization and fine-grained analysis

Dec 15, 2022 • 7 tweets • 3 min read

CMU Advanced NLP is done for 2022! Check the videos on YouTube 😃

I also rehauled our assignments to reflect important skills in NLP for 2022: github.com/neubig/nlp-fro…
If you're teaching/learning NLP see the 🧵 and doc for more!

https://twitter.com/gneubig/status/1569319366558105602

Basically, there have been *huge* changes in NLP due to advances BERT and GPT-3. And the skills needed to be a good NLP researcher or engineer have changed too! I've re-designed our assignments to reflect this.

Mar 3, 2022 • 5 tweets • 3 min read

Retrieval-based models are increasingly important in NLP/QA. But an important factor in modeling text is knowing *where* it came from. Our #ICLR2022 paper proposes retrieval-based LMs considers the "structural locality" of texts to improve retrieval: arxiv.org/abs/2110.02870 🧵↓

We demonstrate this on two example datasets: Wikipedia articles and Java code. We leveraging the article and project structure respectively to define different "locality" levels between two documents.

Oct 14, 2021 • 7 tweets • 3 min read

We've been on a multi-year effort to take steps towards understanding how well NLP/language tech serves people on a *global* scale. Here's a first report: arxiv.org/abs/2110.06733

We perform meta-analysis of performance across 7 tasks, and devise "global utility" metrics. 1/7

The idea is that language tech should serve every person in the world, not just English native speakers. Based on this, we come up with metrics for language-weighted and population-weighted performance that explicitly consider how many people or languages may benefit 2/7

Mar 9, 2020 • 6 tweets • 3 min read

Super-excited about our new #ICASSP2020 paper on "Universal Phone Recognition with a Multilingual Allophone System" arxiv.org/abs/2002.11800

We create a multi-lingual ASR model that can do zero-shot phone recognition in up to 2,186 languages! How? A little linguistics :) 1/5 In our speech there are phonemes (sounds that can support lexical contrasts in a *particular* language) and their corresponding phones (the sounds that are actually spoken, which are language *independent*). Most multilingual ASR models conflate these two concepts. 2/5

Share this page!

Enter URL or ID to Unroll