Graham Neubig Profile picture
Oct 14, 2021 7 tweets 3 min read Read on X
We've been on a multi-year effort to take steps towards understanding how well NLP/language tech serves people on a *global* scale. Here's a first report: arxiv.org/abs/2110.06733

We perform meta-analysis of performance across 7 tasks, and devise "global utility" metrics. 1/7
The idea is that language tech should serve every person in the world, not just English native speakers. Based on this, we come up with metrics for language-weighted and population-weighted performance that explicitly consider how many people or languages may benefit 2/7
We then collect performance metrics for seven different tasks, and calculate how well these tasks are doing to serve every language or every population. See some of the breakdowns in the attached figure. 3/7
This allows us to approximate how well a technology is serving potential users throughout the world. It also allows us to identify "pain points," languages that seem to be most underserved, based on our priorities with respect to equity of language or population coverage. 4/7
We also discuss some potential reasons behind current inequities, such as the economic or academic incentives that may cause technology for a particular language to be more or less researched. 5/7
This is a tremendously difficult problem, and the current paper just scratched the surface (with many simplifying assumptions). Nonetheless we (@blasi_lang, @anas_ant, and me) hope this can start a dialog and focus attention/effort on improving technologies globally. 6/7
The overall project has just started and we would definitely love feedback and/or contributions from the broader community! 7/7

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Graham Neubig

Graham Neubig Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @gneubig

Dec 19
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?

In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks. Image
Why is this benchmark important?

Right now it is unclear how effective AI is at helping with real-world work. We hear extreme statements like:

> AI is overhyped, minimally helpful, and doesn’t generalize to new tasks
> AGI will automate all human work in the next few years
This question is hard to answer, but it has implications for:
- Companies: to understand where to incorporate AI in workflows
- Workers: to get a grounded sense of what AI can and cannot do
- Policymakers: to understand effects of AI on the labor market

How can we begin on it?
Read 13 tweets
Dec 19, 2023
Google’s Gemini recently made waves as a major competitor to OpenAI’s GPT. Exciting! But we wondered:

How good is Gemini really?

At CMU, we performed an impartial, in-depth, and reproducible study comparing Gemini, GPT, and Mixtral.

Paper:
🧵 arxiv.org/abs/2312.11444
Image
We compared accuracy across 6 different varieties of tasks:
* Knowledge-based QA (MMLU)
* Reasoning (BIG-Bench Hard)
* Math (GSM8k, SVAMP, ASDIV, MAWPS)
* Code Gen (HumanEval, ODEX)
* Translation (FLORES)
* Web Instruction Following (WebArena)
We tried to control for all variables, using the same prompts, generation params, and evals for all models for fairness. We used:
* @LiteLLM to query models in a uniform way
* @try_zeno to do comprehensive in-depth analysis
All code/data available here: github.com/neulab/gemini-…

Image
Image
Read 12 tweets
May 18, 2023
There are so many chatbots nowadays, it’s hard to keep up!

To help out, we made an open source tool for automatic comparison of chatbots, and created a report on LLaMa, Alpaca, Vicuna, ChatGPT, Cohere, etc.!

Report: github.com/zeno-ml/zeno-b…
Browser: zeno-ml-chatbot-report.hf.space

🧵⬇️ Image
Our new tool, “Zeno Build” (github.com/zeno-ml/zeno-b…), aims to make it easier to build and evaluate systems using LMs, and includes:

* Interfaces to various open-source and API-based models
* Automatic evaluation of the responses
* Visualization and fine-grained analysis Image
To compare chatbots, we put the following models head-to-head:

@OpenAI GPT-2
@MetaAI LLaMa
@stanfordnlp Alpaca
@lmsysorg Vicuna
@MosaicML MPT
@OpenAI gpt-3.5-turbo
@CohereAI command-xlarge
Read 14 tweets
Dec 15, 2022
CMU Advanced NLP is done for 2022! Check the videos on YouTube 😃

I also rehauled our assignments to reflect important skills in NLP for 2022: github.com/neubig/nlp-fro…
If you're teaching/learning NLP see the 🧵 and doc for more!
Basically, there have been *huge* changes in NLP due to advances BERT and GPT-3. And the skills needed to be a good NLP researcher or engineer have changed too! I've re-designed our assignments to reflect this.
Assignment 1 is now "Build your own BERT", which is a more traditional implementation assignment, building implementation skills and understanding of transformers and the pre-train and fine-tune paradigm.
Read 7 tweets
Mar 3, 2022
Retrieval-based models are increasingly important in NLP/QA. But an important factor in modeling text is knowing *where* it came from. Our #ICLR2022 paper proposes retrieval-based LMs considers the "structural locality" of texts to improve retrieval: arxiv.org/abs/2110.02870 🧵↓
We demonstrate this on two example datasets: Wikipedia articles and Java code. We leveraging the article and project structure respectively to define different "locality" levels between two documents.
Our analysis shows that the distance between embeddings, used widely in retrieval tasks, is *not* capturing this locality directly, so further improvements are needed. We do this by learning a function to adjust the distance metric for each locality level in KNN language models.
Read 5 tweets
Mar 9, 2020
Super-excited about our new #ICASSP2020 paper on "Universal Phone Recognition with a Multilingual Allophone System" arxiv.org/abs/2002.11800

We create a multi-lingual ASR model that can do zero-shot phone recognition in up to 2,186 languages! How? A little linguistics :) 1/5
In our speech there are phonemes (sounds that can support lexical contrasts in a *particular* language) and their corresponding phones (the sounds that are actually spoken, which are language *independent*). Most multilingual ASR models conflate these two concepts. 2/5
We create a model that first recognizes to language-independent phones, and then converts these phones to language-specific phonemes. This makes our underlying representations of phones more universal and generalizable across languages. 3/5
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(