Thread by @matthiassamwald on Thread Reader App

We suffered through curating and analysing thousands of benchmarks -- to better understand the (mis)measurement of AI! 📏🤖🔬

We cover all of #NLProc and #ComputerVision.

Now live at @NatureComms! nature.com/articles/s4146…

1/

Benchmarks are crucial to measuring and steering AI progress.

Their number has become astounding.

Each has unique patterns of activity, improvement and eventual stagnation/saturation. Together they form the intricate story of global progress in AI. 🌐
2/

We found a sizable portion of benchmarks have kind of reached saturation ("can't get better than this") or stagnation ("could get better, but we don't know how / nobody tries"). But still a lot of dynamic benchmarks as well!
3/

How does activity and improvement develop over time and different domains? We mapped all benchmarks into an #RDF #KnowledgeGraph / ontology and devised a novel, highly condensed visualisation method.

Darker green means steeper progress. More chaotic than expected!
4/

... And the lifecycle maps show the birth, life and death of benchmarks. 🌈☠️
5/

Speaking of the life of a benchmark: It's hard. Most benchmark datasets are unpopular. 🥲
6/

How to become more popular (as a benchmark dataset)? Traits correlated with popularity:

- versatile (cover more tasks, have more sub-benchmarks)
- have a dedicated leaderboard
- be created by people from top-institutions
7/

The biggest obstacle and limitation for our work is data availability.

This analysis was only possible by using data from the fabulous @paperswithcode. As a community, we should incentivize depositing results in paperswithcode more! Lots of potential added value.
8/

Shoutouts to co-authors: @nomisto_ @DrAdriBarbosa @JanMBrauner Kathrin Blagec

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll