@emilymbender.bsky.social Profile picture
Prof, Linguistics, UW // Faculty Director, CLMS // she/her // @emilymbender@dair-community.social & bsky // rep by @ianbonaparte

Nov 2, 2017, 7 tweets

I'm current reading a paper that is claiming lg independence based on a train/test split of the 37 UD 1.2 treebanks.

The paper claims that those treebanks represent 37 languages. They don't. It's 33.

Worse -- one of the doubly represented languages is put once in train (fi) and once in test (fi_ftb).

Worse -- they claim to have gotten this split from Wang & @adveisner 2016, published in TACL.

Making datasets available to world is good, making fundamentally flawed datasets available to the world is bad.

And shame on the TACL reviewers for not catching this!

In sum: If you don't understand the data (in #NLProc or any other field), you are not in a position to make valid claims.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling