Read on Twitter

Delip Rao @deliprao

, 15 tweets, 3 min read Read on Twitter

https://twitter.com/lousylinguist/status/1068285983483822085

https://twitter.com/lousylinguist/status/1068285983483822085

Stopwords are sometimes called “non-content words”. This notion is true only in certain situations. For e.g. topic classification. But in many situations, the stopwords *are* the most informative content, e.g., authorship attribution. But something else is going on here. #nlproc

https://twitter.com/lousylinguist/status/1068285983483822085

First, let’s understand why someone might want to eliminate stopwords? The origins of this method lies in information retrieval where it became a common practice to eliminate high frequency words (stopwords) while building inverted indices.

Why? The word “the”, for e.g, might appear in all documents. Similarly “a”, “an” ... As a consequence the inverted index blows up in size. And not just the construction cost, but also the retrieval cost goes up. Simple solution from the 70s: just drop the high frequency words.

Aside: the term “stop word” comes from “stop list”, also an IR term. Google it your own.

This worked in IR as high frequency words are not content words, and the topic-focus of the IR task makes stopword elimination a reasonable option. While this was true historically, it’s no longer applicable even in IR (we will come back to that).

But now let’s get back to the land of #nlproc . Historically, many IR researchers did NLP and vice-versa (e.g Susan Dumais, Ken Church, etc). So the practice of eliminating stopwords came to NLP very early on.

For the early NLP models, eliminating stopwords made sense. It was a way to handle the curse of dimensionality. If you read the early works, it becomes apparent that these early researchers knew what they were doing and they were not simply eliminating stopwords as a ritual.

So if you were doing a bag-of-words classification and your task happened to be topic classification, then may be eliminating stopwords might help. Now let’s get back to Chris’s tweet (OP).

There reason Chris got an increase in performance has to do with understanding transfer learning (used in that linked blog post).

The LM (a sequence model) used in ULMFit was trained on an English corpus with stopwords intact. So by throwing away the stopwords you’re creating (or worsening) a covariate shift.

When the stopwords are retained, that covariate shift is corrected and the (RNN) model’s performance improves. So does it mean we never have to eliminate stop words? Not always so.

Homework: Think of few situations where you might *want* to eliminate stopwords in 2018.

Now let’s circle back to IR. Eliminating stopwords to conserve memory and compute resources made sense historically when memory and compute were expensive. Today, most web search engines don’t eliminate stopwords. Even in IR eliminating stopwords can cause undesirable effects.

Unrelated: if you use Elastic or Lucene for a search product, you might have unintentionally enabled stopwords elimination because many default analyzers do that. Might be worth examining them.

Unrelated: the linked post is not a particularly good example of experimentation and beginners may pick up bad issues. (contd)

For example, comparing 1 epoch of a model trained from scratch with a 1 epoch of fine-tuned model. There are more issues but we digress from our main topic of stopwords.

Like this thread? Get email updates or save it to PDF!

Subscribe to Delip Rao

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Delip Rao

This content may be removed anytime!

Try unrolling a thread yourself!

Related hashtags

Related threads

Trending hashtags

Did Thread Reader help you today?