Dan Hollick Profile picture
Apr 13, 2023 26 tweets 9 min read Read on X
How does a Large Language Model like ChatGPT actually work?

Well, they are both amazingly simple and exceedingly complex at the same time.

Hold on to your butts, this is a deep dive ↓
You can think of a model as calculating the probabilities of an output based on some input.

In language models, this means that given a sequences of words they calculate the probabilities for the next word in the sequence.

Like a fancy autocomplete. Image
To understand where these probabilities come from, we need to talk about something called a neural network.

This is a network like structure where numbers are fed into one side and probabilities are spat out the other.

They are simpler than you might think.
Imagine we wanted to train a computer to solve the simple problem of recognising symbols on a 3x3 pixel display.

We would need a neural net like this:
- an input layer
- two hidden layers
- an output layer Image
Our input layer consists of 9 nodes called neurons - one for each pixel. Each neuron would hold a number from 1 (white) to -1 (black).

Our output layer consists of 4 neurons, one for each of the possible symbols. Their value will eventually be a probability between 0 and 1. Image
In between these, we have rows of neurons, called "hidden" layers. For our simple use case we only need two.

Each neuron is connected to the neurons in the adjacent layers by a weight, which can have a value between -1 and 1. Image
When a value is passed from the input neuron to the next layer its multiplied by the weight.

That neuron then simply adds up all the values it receives, squashes the value between -1 and 1 and passes it to each neuron in the next layer. Image
The neuron in the final hidden layer does the same but squashes the value between 0 and 1 and passes that to the output layer.

Each neuron in the output layer then holds a probability and the highest number is the most probable result. Image
When we train this network, we feed it an image we know the answer to and calculate the difference between the answer and the probability the net calculated.

We then adjust the weights to get closer to the expected result.

But how do we know *how* to adjust the weights?
I won't go into detail, but we use clever mathematical techniques called gradient descent and back propagation to figure out what value for each weight will give us the lowest error.

We keep repeating this process until we are satisfied with the model's accuracy.
This is called a feed forward neural net - but this simple structure won't be enough to tackle the problem of natural language processing.

Instead LLMs tend to use a structure called a transformer and it has some key concepts that unlock a lot of potential. Image
First, lets talk about words.

Instead of each word being an input, we can break words down into tokens which can be words, subwords, characters or symbols.

Notice how they even include the space.
Much like in our model we represent the pixel value as a number between 0 and 1, these tokens also need to be represented as a number.

We could just give each token a unique number and call it a day but there's another way we can represent them that adds more context.
We can store each token in a multi-dimensional vector that indicates it's relationship to other tokens.

For simplicity, imagine a 2D plane on which we plot the location of words. We want words with similar meanings to be grouped closer together.

This is called an embedding. Image
Embeddings help create relationships between similar words but they also capture analogies.

For example the distance between the words dog and puppy should be the same as the distance between cat and kitten.

We can also create embeddings for whole sentences. Image
The first part of the transformer is to encode our input words into these embeddings.

Those embeddings are then fed to the next process, called attention which adds even more context to embeddings.

Attention is massively important in natural language processing. Image
Embeddings struggle to capture words with multiple meanings.

Consider the two meanings of 'bank'. Humans derive the correct meaning based on the context of the sentence.

'Money' and 'River' are contextually important to the word bank in each of these sentences. Image
The process of attention looks back through the sentence for words that provide context to the word bank.

It then re-weights the embedding so that the word bank is semantically closer to the word 'river' or 'money'. Image
This attention process happens many times over to capture the context of the sentence in multiple dimensions.

After all this, the contextualised embeddings are eventually passed into a neural net, like our simple one from earlier, that produces probabilities.
That is a massively simplified version of how an LLM like ChatGPT actually works. There is so much I've left out or skimmed over for the sake of brevity (this is the 20th tweet).

If I left something important out or got some detail wrong, let me know.
I've been researching this thread on/off for ~6 months. Here are some of the best resources:

Obviously this absolute unit of a piece by Stephen Wolfram which goes into much more detail than I have here:

writings.stephenwolfram.com/2023/02/what-i…
For something a bit more approachable, this @3blue1brown video series about how neural nets learn.

There are some lovely graphics here, even if you don't follow all the maths.

And @CohereAI's blog has some excellent pieces and videos explaining embeddings, attention and transformers.

txt.cohere.ai/what-is-attent…
txt.cohere.ai/sentence-word-…
There are tons more but that should keep you busy for a while.

If you managed to read all the way down here, you might be heartened to know none of this was written by AI.

I'm told it helps to remind you to retweet the original tweet.
You may now let go of your butts.
You can read the unrolled version here:

typefully.com/DanHollick/yA3…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dan Hollick

Dan Hollick Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DanHollick

Jun 8, 2023
Excited to finally share what I've been working on.

Introducing Detax, a comprehensive suite of automated tax avoidance and money laundering tools. Our goal is to revolutionise the financial fraud industry by bringing it into the 21st century.

detax.framer.website
Historically, avoiding tax has been a manual, labor intensive task which has put it out of the reach of most small businesses.

We're here to change that and open up a world of financial fraud to businesses of any size.
This is, of course, not real.

This is actually the site we're going to build in my advanced @framer course.

You can join the waitlist at advancedframer.com . Beta access in July.
Read 4 tweets
Mar 16, 2023
Why are some typefaces harder to read than others at the same font-size?

Well, it has a lot to do with x-height but of course it's a bit more complicated than that: ↓
You probably know this already but the x-height of a typeface is the difference between the baseline and the height of the lowercase letters.

We can also think of x-height as a ratio of the total cap height or body height.
Typically we use the letter x to determine this, hence the name.

Interestingly, curved lowercase letters like a, c, and e are often slightly taller than the x-height. They purposefully overshoot so that they appear the same visual height as x, v, w, and z.

Anyway...
Read 13 tweets
Feb 1, 2023
Why is making a dark mode greyscale so hard to get right?

Well, of course it has to do with the weird way humans perceive colour and contrast. 👇
The main issue is we we perceive contrast as higher between lighter colours.

Even though these two sets of colours have the same mathematical contrast we perceive the left as having more contrast than the right. Image
So in designs that use a lot of subtle light greys (containers, borders etc) it isn't as simple as just inverting the colour scale.

The contrast between darker greys will be lower as you can see below.
Read 16 tweets
Dec 15, 2022
Ever heard of a shader but too afraid to ask what it even means at this point?

Lets fuck around and find out 👇
First, we need to understand the difference between a CPU and GPU and why we need a separate unit just to draw graphics anyway.

(Don't worry, it's not going to be too technical.)
The thing that makes a CPU great is that it can perform a wide variety of tasks relatively quickly.

But the limitation is that each CPU core is optimised to do one thing at a time and so tasks are largely performed sequentially.
Read 19 tweets
Nov 2, 2022
Did you know these are called Boolean Operations?

That's because they use booleans to determine which part of the shapes should be visible.

Let me explain 👇 Image
To keep it simple lets use two overlapping circles.

Imagine a boolean for each circle called "inside" - for the area inside each circle the boolean is true, for the area outside it is false.

Any point within the bounding box can be represented in terms of these booleans. Image
Boolean operations are just a way of combining these booleans together to make a new shape.

Take Union for example- the new shape will consist of any area that is inside the red circle OR the blue circle. Image
Read 12 tweets
Oct 20, 2022
Do you just click different blending modes until it sort of looks right?

Well, that probably won't change after you read this but at least you probably won't use Lighten or Darken again.

(you should bookmark this thread and use it as a reference)
Put simply, blending modes are a way of creating a new colour based on two input colours.

Hierarchy matters to the way we work out the new colour so the input colours are split into background and foreground.
Let's start with Darken and Lighten

These are the simplest blending modes and the produce a fairly predictable darkening or lightening.
Read 26 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(