Post

@blasi_lang

More from @gneubig

Graham Neubig

@gneubig

Dec 19, 2024

How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?

In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks.

Why is this benchmark important?

Right now it is unclear how effective AI is at helping with real-world work. We hear extreme statements like:

> AI is overhyped, minimally helpful, and doesn’t generalize to new tasks
> AGI will automate all human work in the next few years

This question is hard to answer, but it has implications for:
- Companies: to understand where to incorporate AI in workflows
- Workers: to get a grounded sense of what AI can and cannot do
- Policymakers: to understand effects of AI on the labor market

How can we begin on it?

Read 13 tweets

Graham Neubig

@gneubig

Dec 19, 2023

Google’s Gemini recently made waves as a major competitor to OpenAI’s GPT. Exciting! But we wondered:

How good is Gemini really?

At CMU, we performed an impartial, in-depth, and reproducible study comparing Gemini, GPT, and Mixtral.

Paper:
🧵 arxiv.org/abs/2312.11444

We compared accuracy across 6 different varieties of tasks:
* Knowledge-based QA (MMLU)
* Reasoning (BIG-Bench Hard)
* Math (GSM8k, SVAMP, ASDIV, MAWPS)
* Code Gen (HumanEval, ODEX)
* Translation (FLORES)
* Web Instruction Following (WebArena)

We tried to control for all variables, using the same prompts, generation params, and evals for all models for fairness. We used:
* @LiteLLM to query models in a uniform way
* @try_zeno to do comprehensive in-depth analysis
All code/data available here: github.com/neulab/gemini-…

Read 12 tweets

Graham Neubig

@gneubig

May 18, 2023

There are so many chatbots nowadays, it’s hard to keep up!

To help out, we made an open source tool for automatic comparison of chatbots, and created a report on LLaMa, Alpaca, Vicuna, ChatGPT, Cohere, etc.!

Report: github.com/zeno-ml/zeno-b…
Browser: zeno-ml-chatbot-report.hf.space

🧵⬇️

Our new tool, “Zeno Build” (github.com/zeno-ml/zeno-b…), aims to make it easier to build and evaluate systems using LMs, and includes:

* Interfaces to various open-source and API-based models
* Automatic evaluation of the responses
* Visualization and fine-grained analysis

@OpenAI

To compare chatbots, we put the following models head-to-head:

@OpenAI GPT-2
@MetaAI LLaMa
@stanfordnlp Alpaca
@lmsysorg Vicuna
@MosaicML MPT
@OpenAI gpt-3.5-turbo
@CohereAI command-xlarge

Read 14 tweets

Graham Neubig

@gneubig

Dec 15, 2022

https://twitter.com/gneubig/status/1569319366558105602

CMU Advanced NLP is done for 2022! Check the videos on YouTube 😃

I also rehauled our assignments to reflect important skills in NLP for 2022: github.com/neubig/nlp-fro…
If you're teaching/learning NLP see the 🧵 and doc for more!

https://twitter.com/gneubig/status/1569319366558105602

Basically, there have been *huge* changes in NLP due to advances BERT and GPT-3. And the skills needed to be a good NLP researcher or engineer have changed too! I've re-designed our assignments to reflect this.

Assignment 1 is now "Build your own BERT", which is a more traditional implementation assignment, building implementation skills and understanding of transformers and the pre-train and fine-tune paradigm.

Read 7 tweets

Graham Neubig

@gneubig

Mar 3, 2022

Retrieval-based models are increasingly important in NLP/QA. But an important factor in modeling text is knowing *where* it came from. Our #ICLR2022 paper proposes retrieval-based LMs considers the "structural locality" of texts to improve retrieval: arxiv.org/abs/2110.02870 🧵↓

We demonstrate this on two example datasets: Wikipedia articles and Java code. We leveraging the article and project structure respectively to define different "locality" levels between two documents.

Our analysis shows that the distance between embeddings, used widely in retrieval tasks, is *not* capturing this locality directly, so further improvements are needed. We do this by learning a function to adjust the distance metric for each locality level in KNN language models.

Read 5 tweets

Graham Neubig

@gneubig

Mar 9, 2020

Super-excited about our new #ICASSP2020 paper on "Universal Phone Recognition with a Multilingual Allophone System" arxiv.org/abs/2002.11800

We create a multi-lingual ASR model that can do zero-shot phone recognition in up to 2,186 languages! How? A little linguistics :) 1/5

In our speech there are phonemes (sounds that can support lexical contrasts in a *particular* language) and their corresponding phones (the sounds that are actually spoken, which are language *independent*). Most multilingual ASR models conflate these two concepts. 2/5

We create a model that first recognizes to language-independent phones, and then converts these phones to language-specific phonemes. This makes our underlying representations of phones more universal and generalizable across languages. 3/5

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

Graham Neubig

Try unrolling a thread yourself!

More from @gneubig

Graham Neubig

Graham Neubig

Graham Neubig

Graham Neubig

Graham Neubig

Graham Neubig

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!