Delip Rao Profile picture
Sep 6 25 tweets 8 min read
Language Models have taken #NLProc by storm. Even if you don’t directly work in NLP, you have likely heard and possibly, used language models. But ever wonder who came up with the term “Language Model”? Recently I went on that quest, and I want to take you along with me. 🧶
I am teaching a graduate-level course on language models and transformers at @ucsc this Winter, and out of curiosity, I wanted to find out who coined the term “Language Model”.
First, I was a bit ashamed I did not know this fact after all these years in NLP. Surely, this should be in any NLP textbook, right? Wrong! I checked every NLP textbook I could get my hands on, and all of them define what an LM is, without giving any provenance to the term.
Here's an example from the 3rd edition of @jurafsky and Martin. Other books do it similarly.
I also quickly skimmed some popular NLP courses, including ones from Johns Hopkins, Stanford, Berkeley, etc., and they don't cover it either. You might be thinking that this detail is not important and thereby justifying its omission.
Normally, I would agree with you that there's no point learning random historical facts for their sake unless the provenance shows us to understand better what's being examined. So on we go to the quest! 🐈
Now, everyone knows Shannon worked out the entropy of the English language as a means to his end -- developing a theory of optimal communication.
His 1948 paper is very readable, but there is no mention of the term "Language Model" in it or in any of his follow-up works. people.math.harvard.edu/~ctm/home/text…
The next allusion to a language model was in 1958 by @ChomskyDotInfo in his famous paper on the "three models for the description of language", where he calls them "Finite State Markov Processes".
Chomsky got too caught up with grammar and did not think much of these finite state Markov Processes, and neither did he use the term "Language Model" for them. chomsky.info/wp-content/upl…
I also knew @jhuclsp had a famous 1995 workshop on adding syntax in language models ("LM95"). Going through the proceedings, it appears like the term Language Model was very operational by then. So, the term must have originated somewhere between 1958 and 1995.
After some digging, I have finally landed on this 1976 paper: citeseerx.ist.psu.edu/viewdoc/downlo…
I feel quite confident that this is the first source, and Fred Jelinek was likely the first person to use the term "language model" in the scientific literature; it happens specifically in this paragraph where the term shows up in italics:
The paper itself is a landmark paper. It first described modern pre-deep learning ASR engines to the world, and the architecture described there is still used in countless deployments even today. Fred lays out the architecture as follows:
The “Linguistic Decoder” takes a phone sequence and returns scored sequences of words.
Later in the text, Fred clearly points out that the language model (used for the first time, in italics) is just one of the ways to do this linguistic decoding.
Footnote #17 is a dig at Chomsky, where Fred is basically saying, “don’t read too much into LMs. They just define the probability of a sequence. There’s nothing linguistic about it.”
This footnote is actually quite important. Today, with large language models sweeping NLP tasks across the board with their performance, it might be understandable to think LMs "understand" language.
@Smerity dug out this gem for me from 5 years ago (yes, I have friends like that) where I contradict that. Is he right?
If you are in the Fred Jelinek camp, Language Modeling per se is not an NLP problem. That LMs are used in solving NLP problems is a consequence of the language modeling objective. Much like the combustion engine science has nothing to do with flying, but jet engines use them.
In this world of neural LMs, I would define Language Models as intelligent representations of language. I use “intelligence” here in a @SchmidhuberAI-style compression-is-intelligence manner. arxiv.org/abs/0812.4360
Pretext tasks, like the MLM task, capture various aspects of language, and reducing the cross-entropy loss is equivalent to maximizing compression, i.e., “intelligence”.
So to land this thread, knowing where terms come from matters. It allows us to understand precisely when multiple interpretations are possible from the surface meaning of the term. Next time you want to cite something canonical for LMs, don’t forget (Jelinek, 1978)! 🧶🐈
PS: More AI folks should know about Fred Jelinek. He's a giant on whose shoulders many of us stand. web.archive.org/web/2016122819…
Correction: one of my earlier tweets refers to the paper in question as (Jelinek, 1978), but as evident from the bibtex entry, it’s (Jelinek, 1976).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Delip Rao

Delip Rao Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @deliprao

Aug 16
People are opposed to Flow because they cannot believe Adam Neumann is getting funded again despite all the shenanigans with WeWork. I think that’s wrong. If anything, he has proven to be venture fundable, by yardsticks VCs use. The real reason to worry about Flow is >
Flow will squeeze the already burdened renters to return 10x or higher to their investors. It will use tech and data science to consolidate the non-commercial properties much like how WeWork consolidated big chunks of commercial real estate, making home ownership impossible.
Airbnb created a rental crisis in almost every desirable city, but regulations limited the stay terms and somewhat stunted their ambitions. What Airbnb didn’t accomplish, Flow might, and as renters and homeowners, we should worry about that.
Read 4 tweets
Jun 19, 2021
The hardest part of being an AI researcher is doing good research requires getting lost in the trees and weeds while also not losing sight of the forest. It's tempting to give up minutiae to see forest-level changes at which point you become more or less a spectator/chronicler.
But those who are lost in the trees are often the ones reshaping the forest as code is where a lot of the discovery happens as opposed to flashes of abstract insight. Staying in the weeds, however, can keep you away from seeing larger patterns and making bold strokes.
If you are a senior researcher with the leverage of working in larger teams where you can delegate some of the minutiae and simultaneously keep track of the forest, this is probably the best situation. Which explains why researchers gravitate to large academic/industrial labs.
Read 4 tweets
May 16, 2021
We might know about this from recent GNN and geometric learning papers, but it first appeared in ML in the “On Manifold Regularization” paper by Belkin, Niyogi, and Sindhwani. That paper was a milestone in semisupervised learning but now forgotten. newtraell.cs.uchicago.edu/files/tr_authe…
The Laplace-Beltrami operator (LBO) on a Riemannian manifold is approximated by the graph laplacian (L = D - A). The normalized graph laplacian has connections with random walks, diffusion processes (Fokker-plank equations), Brownian motion, and heat equations.
Random walks and spectral graph theory, in general, is treasure trove of fun results. The best resource on that is Fan Chung’s amazing textbook ams.org/books/cbms/092/
Read 6 tweets
Jun 28, 2020
If you’re looking for something to watch this late Saturday evening, join me in watching this documentary on Claude Shannon.

vimeo.com/315813606
Password: Shannon-ISIT (valid this weekend)
Going to live tweet somethings because of the Shannon fanboy I am 😄
Okay, first impressions: The documentary itself (title: “THE BIT PLAYER”) is super well produced. In genre, it reminds me of the SECRETS OF THE SURFACE, the movie about the amazing Maryam Mirzakhani, which everyone should watch too.
zalafilms.com/secrets/#about
Read 42 tweets
Mar 5, 2020
Survey of #MachineLearning experimental methods (aka "how do ML folks do their experiments") at #NeurIPS2019 and #ICLR2020, a thread of results:
1. "Did you have any experiments in your paper?"

The future is empirical! If we historically look at NeurIPS papers (not just 2019), the number of theoretical submissions is dwindling and now almost relegated to conferences like UAI, and that's unfortunate.
side note: There was a time when folks used to say, "what experiments? It's a NIPS paper!" (also, I am a dinosaur).
Read 13 tweets
Sep 1, 2019
Speech synthesis as a field has an evaluation problem. The commonly used Mean Opinion Score (MOS) reported across various papers are not comparable. To make things worse, Deepvoice (from Baidu) papers report lower MOS of systems from Google (WaveNet, Tacotron, ..). It's a mess!
I am fairly confident, we as a field (except possibly a select few who know this experientially), do not know where we stand today with speech synthesis. There is absolutely no way you can look at two papers and conclude one is superior than the other.
Speech Synthesis is an art masquerading as science.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(