Tweet

Martin

7 Apr, 17 tweets, 7 min read

@PaulBarba_

@PaulBarba_ held a first @joinClubhouse room with @MuazmaZahid in the Data and AI club. The topic was "How to Curate a Good Dataset for NLP?"

There were a lot of interesting questions asked and at the end of the call lots of interesting people asked for follow up notes...

Below is a thread of the room - topic, intro, why this topic, 4 main tips. I hope this adds value to people and we can do this with more calls in the future. The call itself conveyed a lot more value but I tried to highlight the important bits!

@AiTechDoc

Flagging this for the folks that followed in the call @AiTechDoc @EgboDaniel1 @BabaKirito @gboye_baba @zerotousers @LahijaniAli @talks @yalda2009 @trojkast @JMontoro3 @sroussey content to follow in the thread below

@Lexalytics

1) Who’s Paul?

Chief Scientist at @Lexalytics, an #NLP company solving text problems for businesses. Paul’s been in the field for ~14 years.

2) Why talk on #Clubhouse in the Data and AI group?

There have been a lot of painfully won lessons over the years, dealing with text and how tricky #NLP is. So it’s a fun topic to jump into - best practices, tips, anecdotes, all about curating datasets.

3A) Why this topic?

Probably the single most important decision we make as #NLP practitioners. Things like the algorithms, the deep network, which language models to use are pretty easy to change at the end of the day, machine time is fairly cheap in 99% of the time.

3B) But human time is inherently expensive and having done annotation work it’s fairly thankless. So when the data doesn’t come out, when it’s not useful it’s sort of a shameful waste of human time. So we decided to talk about best practices, especially for supervised learning.

4A) Tip #1

The most important lesson taken away from everything in AI is to start small, as small as you can. Try to train a model as soon as you possibly can. 50 examples even...

4B) If you want to have a lot of classes in your NER models start with one category, get it marked up, see how it’s doing. Because the earlier you catch any issue the easier it is to fix it. Going bit by bit, even if you are moving fast or accelerating is so helpful.

5A) Tip #2

One of the most important things you should be thinking about as you’re generating your own datasets - is this really representative of the real world? An anecdote we shared here is a data sample collected from one single day for news monitoring categorization...

5B) Even though the data volume was huge not realizing it was from a single day skewed the trained model substantially. Another example is social media data and how Tweets might not use proper nouns for NER. You only realize this once you get in the weeds...

6) Tip #3

Track everything. It’s so easy when you’re recording data to record more information. As time goes by whatever you did before is lost forever. E.g. timestamps on data annotation and who tagged which document. This allows for remediations when there are issues.

7A) Tip #4

Assuming you’ve got some sort of budget, set aside resources for the future. Getting labeled data is the gold standard of everything. There’s a lot of ways around it, bootstrapping, co-opting data, etc...

7B) The gold standard is up to date examples from your target distribution, examples from what you actually want to be doing well on. The world changes, especially in text. Text is such a hard problem because it’s not static...

7C) We come up with new ways of saying things, text reflects the world around us, and businesses and products change. E.g. smartphones in early ‘000s smartphones were barely a thing and phone reviews were talking about completely different things. You want to understand this Δ .

7D) Rather than thinking that your #ML product will be fixed there will be weird linguistic changes, there will be things you can’t foresee but you want to be able to react to these changes. The way we approach this is bucketing data into train - validate - test sets.

So these were just the 4 main tips from Paul's talk, there were 20+ live questions we went over. Let me know if you found this useful or have any questions. Looking forward to a next Clubhouse room 👀

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Martin

Try unrolling a thread yourself!

Did Thread Reader help you today?

Like this author's thread?