cohere Profile picture
Jan 25 16 tweets 4 min read
Word and sentence embeddings are the bread and butter of language models. 📖

Here is a very simple introduction to embeddings. (Thread)
The quintessential task of natural language processing (NLP) is to understand human language. 

However, there is a big disconnect — humans speak in words and sentences, but computers only understand and process numbers.
To solve this problem, we can use embeddings. 

An assignment of words to numbers is called a word embedding.

Think of it as an assignment of scores to the words.
Let’s take an example.

Below, we have 12 words located on a plane: Banana, Basketball, Bicycle, Building, Car, Castle, Cherry, House, Soccer, Strawberry, Tennis, Truck.

Now, the question is, out of A, B, C, and D, where would you locate the word “Apple” in this plane?
A reasonable answer is point C, because it would make sense to have the word “Apple” close to the words “Banana,” “Strawberry,” and “Cherry,” and far from the other words, such as “House,” “Car,” or “Tennis.”

This is precisely a word embedding.
In this example, the word embeddings are the horizontal and vertical coordinates of the location of the word.

In this way, the word “Apple” is assigned to the numbers [5,5] (also called vectors), and the word “Bicycle” to the numbers [5,1].
Here are some properties that a nice word embedding should have:

- Words that are similar should correspond to points that are close by.
- Words that are different should correspond to points that are far away.
In that example, the word embedding contains two dimensions, represented by the two axes.

A good word embedding though would contain many more dimensions.

The Cohere large embedding model, for example, has 4096 dimensions.
More importantly, the Cohere embeddings capture not only word-level, but sentence-level embeddings.

Why is this important? 

Take one example. Say we have word embeddings for these 4 words:

- No: [1,0,0,0]
- I: [0,2,0,0]
- Am: [-1,0,1,0]
- Good: [0,0,1,3]
How would one be able to represent, for instance, a sentence?

Well, here’s an idea. How about the sums of scores of all the words?

Then “No, I am good!” corresponds to the vector [0,2,2,3].

However, “I am no good” will also correspond to the vector [0,2,2,3].
This is not a good thing, since the computer understands these two sentences in the exact same way, yet they are quite opposite sentences!
This is where sentence embeddings come into play.

A sentence embedding is just like a word embedding, except it associates every sentence with a vector full of numbers in a coherent way.
Similar sentences are assigned to similar vectors, and different sentences are assigned to different vectors.

Each of the coordinates of the vector identifies some (whether clear or obscure) property of the sentence.
The Cohere embedding does just this. 

Below are nine sample queries from the Airline Travel Information Systems (ATIS) dataset.

And here is a heatmap of the compressed version of the embeddings (4096 compressed to 10).
Look at the three highlighted sentences. They talk about similar things: queries about ground transportation. 

Now, their embedding values are also similar to each other and distinct from the rest (note the color shading).

That is what a sentence embedding should do.
Want to learn more?

Read the full blog post about word and sentence embeddings: txt.cohere.ai/sentence-word-…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with cohere

cohere Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(