Graham Neubig Profile picture
Associate professor at CMU, studying natural language processing and machine learning. CEO, Inspired Cognition (https://t.co/sj5PilHMK9).
Dec 19 13 tweets 6 min read
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?

In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks. Image Why is this benchmark important?

Right now it is unclear how effective AI is at helping with real-world work. We hear extreme statements like:

> AI is overhyped, minimally helpful, and doesn’t generalize to new tasks
> AGI will automate all human work in the next few years
Dec 19, 2023 12 tweets 9 min read
Google’s Gemini recently made waves as a major competitor to OpenAI’s GPT. Exciting! But we wondered:

How good is Gemini really?

At CMU, we performed an impartial, in-depth, and reproducible study comparing Gemini, GPT, and Mixtral.

Paper:
🧵 arxiv.org/abs/2312.11444
Image We compared accuracy across 6 different varieties of tasks:
* Knowledge-based QA (MMLU)
* Reasoning (BIG-Bench Hard)
* Math (GSM8k, SVAMP, ASDIV, MAWPS)
* Code Gen (HumanEval, ODEX)
* Translation (FLORES)
* Web Instruction Following (WebArena)
May 18, 2023 14 tweets 9 min read
There are so many chatbots nowadays, it’s hard to keep up!

To help out, we made an open source tool for automatic comparison of chatbots, and created a report on LLaMa, Alpaca, Vicuna, ChatGPT, Cohere, etc.!

Report: github.com/zeno-ml/zeno-b…
Browser: zeno-ml-chatbot-report.hf.space

🧵⬇️ Image Our new tool, “Zeno Build” (github.com/zeno-ml/zeno-b…), aims to make it easier to build and evaluate systems using LMs, and includes:

* Interfaces to various open-source and API-based models
* Automatic evaluation of the responses
* Visualization and fine-grained analysis Image
Dec 15, 2022 7 tweets 3 min read
CMU Advanced NLP is done for 2022! Check the videos on YouTube 😃

I also rehauled our assignments to reflect important skills in NLP for 2022: github.com/neubig/nlp-fro…
If you're teaching/learning NLP see the 🧵 and doc for more! Basically, there have been *huge* changes in NLP due to advances BERT and GPT-3. And the skills needed to be a good NLP researcher or engineer have changed too! I've re-designed our assignments to reflect this.
Mar 3, 2022 5 tweets 3 min read
Retrieval-based models are increasingly important in NLP/QA. But an important factor in modeling text is knowing *where* it came from. Our #ICLR2022 paper proposes retrieval-based LMs considers the "structural locality" of texts to improve retrieval: arxiv.org/abs/2110.02870 🧵↓ We demonstrate this on two example datasets: Wikipedia articles and Java code. We leveraging the article and project structure respectively to define different "locality" levels between two documents.
Oct 14, 2021 7 tweets 3 min read
We've been on a multi-year effort to take steps towards understanding how well NLP/language tech serves people on a *global* scale. Here's a first report: arxiv.org/abs/2110.06733

We perform meta-analysis of performance across 7 tasks, and devise "global utility" metrics. 1/7 The idea is that language tech should serve every person in the world, not just English native speakers. Based on this, we come up with metrics for language-weighted and population-weighted performance that explicitly consider how many people or languages may benefit 2/7
Mar 9, 2020 6 tweets 3 min read
Super-excited about our new #ICASSP2020 paper on "Universal Phone Recognition with a Multilingual Allophone System" arxiv.org/abs/2002.11800

We create a multi-lingual ASR model that can do zero-shot phone recognition in up to 2,186 languages! How? A little linguistics :) 1/5 In our speech there are phonemes (sounds that can support lexical contrasts in a *particular* language) and their corresponding phones (the sounds that are actually spoken, which are language *independent*). Most multilingual ASR models conflate these two concepts. 2/5