Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Akshay 🚀

@akshay_pachaar

Feb 5, 2025 • 13 tweets • 4 min read • Read on X

Scrolly

How LLMs work, clearly explained:

Before diving into LLMs, we must understand conditional probability.

Let's consider a population of 14 individuals:

- Some of them like Tennis 🎾
- Some like Football ⚽️
- A few like both 🎾 ⚽️
- And few like none

Here's how it looks 👇

So what is Conditional probability ⁉️

It's a measure of the probability of an event given that another event has occurred.

If the events are A and B, we denote this as P(A|B).

This reads as "probability of A given B"

Check this illustration 👇

For instance, if we're predicting whether it will rain today (event A), knowing that it's cloudy (event B) might impact our prediction.

As it's more likely to rain when it's cloudy, we'd say the conditional probability P(A|B) is high.

That's conditional probability for you! 🎉

Now, how does this apply to LLMs like GPT-4❓

These models are tasked with predicting the next word in a sequence.

This is a question of conditional probability: given the words that have come before, what is the most likely next word?

To predict the next word, the model calculates the conditional probability for each possible next word, given the previous words (context).

The word with the highest conditional probability is chosen as the prediction.

The LLM learns a high-dimensional probability distribution over sequences of words.

And the parameters of this distribution are the trained weights!

The training or rather pre-training** is supervised.

I'll talk about the different training steps next time!**

Check this 👇

But there a problem❗️

If we always pick the word with the highest probability, we end up with repetitive outputs, making LLMs almost useless and stifling their creativity.

This is where temperature comes into picture.

Check this before we understand more about it...👇

However a high temperate value produces gibberish

Let's understand what's going on...👇

So, the LLMs instead of selecting the best token (for simplicity let's think of tokens as words), they "sample" the prediction.

So even if “Token 1” has the highest score, it may not be chosen since we are sampling.

Now, temperature introduces the following tweak in the softmax function, which, in turn, influences the sampling process:

Let take a code example!

At low temperature, probabilities concentrate around the most likely token, resulting in nearly greedy generation.

At high temperature, probabilities become more uniform, producing highly random and stochastic outputs.

Check this out👇

That's a wrap!

Hopefully, this guide has demystified some of the magic behind LLMs.

And, if you enjoyed this breakdown:

Find me → @akshay_pachaar ✔️
For more insights and tutorials on AI and Machine Learning.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @akshay_pachaar

Akshay 🚀

@akshay_pachaar

Jun 7

Google just dropped a new LLM!

You can run it locally on just 8GB RAM.

Let's fine-tune this on our own data (100% locally):

Google released Gemma 4 12B, a multimodal model that runs text, images, and audio on 8GB VRAM!

We'll fine-tune it to master chess and predict the exact next move.

Tech stack:
- @UnslothAI for efficient fine-tuning.
- @huggingface transformers to run it locally.

Let's go! 🚀

1️⃣ Load the model

We start by loading Gemma 4 12B and its tokenizer using Unsloth.

Check this 👇

Read 10 tweets

Akshay 🚀

@akshay_pachaar

Jun 3

You're in a Research Scientist interview at OpenAI.

The interviewer asks:

"How would you expand the context length of an LLM from 2K to 128K tokens?"

You: "I will fine-tune the model on longer docs with 128K context."

Interview over.

Here's what you missed:

Extending the context window isn't just about larger matrices.

In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!

So, how do we manage it?

continue...👇

1) Sparse Attention

It limits the attention computation to a subset of tokens by:

- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.

But this has a trade-off between computational complexity and performance.

Read 12 tweets

Akshay 🚀

@akshay_pachaar

Dec 18, 2025

Turn any Autoregressive LLM into a Diffusion LM.

dLLM is a Python library that unifies the training & evaluation of diffusion language models.

You can also use it to turn ANY autoregressive LM into a diffusion LM with minimal compute.

100% open-source.

Here's why this matters:

Traditional autoregressive models generate text left-to-right, one token at a time. Diffusion models work differently - they refine the entire sequence iteratively, giving you better control over generation quality and more flexible editing capabilities.

dLLM GitHub:

(don't forget to star 🌟)github.com/ZHZisZZ/dllm

Read 4 tweets

Akshay 🚀

@akshay_pachaar

Dec 6, 2025

You're in a Research Scientist interview at Google.

Interviewer: We have a base LLM that's terrible at maths. How would you turn it into a maths & reasoning powerhouse?

You: I'll get some problems labeled and fine-tune the model.

Interview over.

Here's what you missed:

When outputs are verifiable, labels become optional.

Maths, code, and logic can be automatically checked and validated.

Let's use this fact to build a reasoning model without manual labelling.

We'll use:

- @UnslothAI for parameter-efficient finetuning.
- @HuggingFace TRL to apply GRPO.

Let's go! 🚀

What is GRPO?

Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.

Here's a brief overview of GRPO before we jump into code:

Read 11 tweets

Akshay 🚀

@akshay_pachaar

Dec 5, 2025

I have been training neural networks for 10 years now.

Here are 16 ways I actively use to optimize model training:

(detailed explanation ...🧵)

First, lets look at some basic techniques:

1) Use efficient optimizers—AdamW, Adam, etc.

2) Utilize hardware accelerators (GPUs/TPUs).

3) Max out the batch size.

4) Use multi-GPU training through Model/Data/Pipeline/Tensor parallelism.

Check the visual👇

5) Bayesian optimization for hyperparameter optimization:

This technique takes informed steps based on the results of the previous hyperparameter configs.

This way, the model converges to an optimal set of hyperparameters much faster.

Check these results 👇

Read 9 tweets

Akshay 🚀

@akshay_pachaar

Nov 23, 2025

You’re in an ML Engineer interview at Google.

Interviewer: We need to train an LLM across 1,000 GPUs. How would you make sure all GPUs share what they learn?

You: Use a central parameter server to aggregate and redistribute the weights.

Interview over.

Here’s what you missed:

One major run-time bottleneck in multi-GPU training happens during GPU synchronization.

For instance, in multi-GPU training via data parallelism:

- The same model is distributed to different GPUs.
- Each GPU processes a different subset of the whole dataset.

Check this 👇

This leads to different gradients across different devices.

So, before updating the model parameters on each GPU device, we must communicate the gradients to all other devices to sync them.

Let’s understand 2 common strategies next!

Read 14 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Akshay 🚀

Try unrolling a thread yourself!

More from @akshay_pachaar

Akshay 🚀

Akshay 🚀

Akshay 🚀

Akshay 🚀

Akshay 🚀

Akshay 🚀

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!