Jan 25 16 tweets 4 min read
Word and sentence embeddings are the bread and butter of language models. 📖

Here is a very simple introduction to embeddings. (Thread)
The quintessential task of natural language processing (NLP) is to understand human language.

However, there is a big disconnect — humans speak in words and sentences, but computers only understand and process numbers.
To solve this problem, we can use embeddings.

An assignment of words to numbers is called a word embedding.

Think of it as an assignment of scores to the words.
Let’s take an example.

Below, we have 12 words located on a plane: Banana, Basketball, Bicycle, Building, Car, Castle, Cherry, House, Soccer, Strawberry, Tennis, Truck.

Now, the question is, out of A, B, C, and D, where would you locate the word “Apple” in this plane?
A reasonable answer is point C, because it would make sense to have the word “Apple” close to the words “Banana,” “Strawberry,” and “Cherry,” and far from the other words, such as “House,” “Car,” or “Tennis.”

This is precisely a word embedding.
In this example, the word embeddings are the horizontal and vertical coordinates of the location of the word.

In this way, the word “Apple” is assigned to the numbers [5,5] (also called vectors), and the word “Bicycle” to the numbers [5,1].
Here are some properties that a nice word embedding should have:

- Words that are similar should correspond to points that are close by.
- Words that are different should correspond to points that are far away.
In that example, the word embedding contains two dimensions, represented by the two axes.

A good word embedding though would contain many more dimensions.

The Cohere large embedding model, for example, has 4096 dimensions.
More importantly, the Cohere embeddings capture not only word-level, but sentence-level embeddings.

Why is this important?

Take one example. Say we have word embeddings for these 4 words:

- No: [1,0,0,0]
- I: [0,2,0,0]
- Am: [-1,0,1,0]
- Good: [0,0,1,3]
How would one be able to represent, for instance, a sentence?

Well, here’s an idea. How about the sums of scores of all the words?

Then “No, I am good!” corresponds to the vector [0,2,2,3].

However, “I am no good” will also correspond to the vector [0,2,2,3].
This is not a good thing, since the computer understands these two sentences in the exact same way, yet they are quite opposite sentences!
This is where sentence embeddings come into play.

A sentence embedding is just like a word embedding, except it associates every sentence with a vector full of numbers in a coherent way.
Similar sentences are assigned to similar vectors, and different sentences are assigned to different vectors.

Each of the coordinates of the vector identifies some (whether clear or obscure) property of the sentence.
The Cohere embedding does just this.

Below are nine sample queries from the Airline Travel Information Systems (ATIS) dataset.

And here is a heatmap of the compressed version of the embeddings (4096 compressed to 10).
Look at the three highlighted sentences. They talk about similar things: queries about ground transportation.

Now, their embedding values are also similar to each other and distinct from the rest (note the color shading).

That is what a sentence embedding should do.