Akshay 🚀 Profile picture
Jun 9, 2023 9 tweets 4 min read Read on X
LLMs are everywhere but do you know how they generate text❓

Let's take the magic out of it and break things down to first principles!

Today I'll explain what is conditional probability and how it is related to LLMs!

A Thread 🧵👇 Image
Before diving into LLMs, lets understand conditional probability.

We consider a population of 14 individuals:

- Some of them like Tennis 🎾
- Some like Football ⚽️
- A few like both 🎾 ⚽️
- And few like none

Here's how it looks 👇 Image
So what is Conditional probability ⁉️

It's a measure of the probability of an event given that another event has occurred.

If the events are A and B, we denote this as P(A|B).

This reads as "probability of A given B"

Check this illustration 👇 Image
For instance, if we're predicting whether it will rain today (event A), knowing that it's cloudy (event B) might impact our prediction.

As it's more likely to rain when it's cloudy, we'd say the conditional probability P(A|B) is high.

That's conditional probability for you! 🎉
Now, how does this apply to LLMs like GPT-4❓

These models are tasked with predicting the next word in a sequence.

This is a question of conditional probability: given the words that have come before, what is the most likely next word? Image
To predict the next word, the model calculates the conditional probability for each possible next word, given the previous words (context).

The word with the highest conditional probability is chosen as the prediction. Image
The LLM learns a high-dimensional probability distribution over sequences of words.

And the parameters of this distribution are the trained weights!

The training or rather pre-training** is supervised.

I'll talk about the different training steps next time!**

Check this 👇 Image
Hopefully, this thread has demystified a bit of the magic behind LLMs and the concept of conditional probability.

if you want to learn more about building with LLMs, @LightningAI has some top resources on the same!

Check this out👇
lightning.ai/pages/blog/ Image
That's a wrap!

If you interested in:

- Python 🐍
- Data Science 📈
- Machine Learning 🤖
- MLOps 🛠
- NLP 🗣
- Computer Vision 🎥
- LLMs 🧠

I'm sharing daily content over here, follow me → @akshay_pachaar if you haven't already!!

Cheers! 🥂

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Akshay 🚀

Akshay 🚀 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @akshay_pachaar

Sep 7
8 key skills to become a full-stack AI Engineer:
Production-grade AI systems demand deep understanding of how LLMs are engineered, deployed, and optimized.

Here are the 8 pillars that define serious LLM development:

Let's dive in! 🚀
1️⃣ Prompt engineering

Prompt engineering is far from dead!

The key is to craft structured prompts that reduce ambiguity and result in deterministic outputs.

Treat it as engineering, not copywriting! ⚙️

Here's something I published on JSON prompting:
Read 12 tweets
Sep 6
K-Means has two major problems:

- The number of clusters must be known
- It doesn't handle outliers

Here’s an algorithm that addresses both issues:
Introducing DBSCAN, a density-based clustering algorithm.

Simply put, DBSCAN groups together points in a dataset that are close to each other based on their spatial density.

It's very easy to understand, just follow along ...👇 Image
DBSCAN has two important parameters.

1️⃣ Epsilon (eps):

`eps`: represents the maximum distance between two points for them to be considered part of the same cluster.

Points within this distance of each other are considered to be neighbours.

Check this out 👇 Image
Read 9 tweets
Sep 4
Let's build a reasoning LLM, from scratch (100% local):
Today, we're going to learn how to turn any model into a reasoning powerhouse.

We'll do so without any labeled data or human intervention, using Reinforcement Finetuning (GRPO)!

Tech stack:

- @UnslothAI for efficient fine-tuning
- @HuggingFace TRL to apply GRPO

Let's go! 🚀
What is GRPO?

Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.

Here's a brief overview of GRPO before we jump into code:
Read 12 tweets
Sep 2
4 stages of training LLMs from scratch, clearly explained (with visuals):
Today, we are covering the 4 stages of building LLMs from scratch to make them applicable for real-world use cases.

We'll cover:
- Pre-training
- Instruction fine-tuning
- Preference fine-tuning
- Reasoning fine-tuning

The visual summarizes these techniques.

Let's dive in!
0️⃣ Randomly initialized LLM

At this point, the model knows nothing.

You ask it “What is an LLM?” and get gibberish like “try peter hand and hello 448Sn”.

It hasn’t seen any data yet and possesses just random weights.

Check this 👇
Read 13 tweets
Aug 30
A new embedding model cuts vector DB costs by ~200x.

It also outperforms OpenAI and Cohere models.

Let's understand how you can use it in LLM apps (with code):
Today, we'll use the voyage-context-3 embedding model by @VoyageAI to do RAG over audio data.

We'll also use:
- @MongoDB Atlas Vector Search as vector DB
- @AssemblyAI for transcription
- @llama_index for orchestration
- gpt-oss as the LLM

Let's begin!
For context...

voyage-context-3 is a contextualized chunk embedding model that produces chunk embeddings with full document context.

This is unlike common chunk embedding models that embed chunks independently.

(We'll discuss the results later in the thread)

Check this👇
Read 14 tweets
Aug 29
I have been training neural networks for 10 years now.

Here are 16 ways I actively use to optimize model training:
Before we dive in, the following visual covers what we are discussing today.

Let's understand them in detail below!
These are some basic techniques:

1) Use efficient optimizers—AdamW, Adam, etc.

2) Utilize hardware accelerators (GPUs/TPUs).

3) Max out the batch size.

4) Use multi-GPU training through Model/Data/Pipeline/Tensor parallelism. Check the visual👇
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(