You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...π
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.
dLLM is a Python library that unifies the training & evaluation of diffusion language models.
You can also use it to turn ANY autoregressive LM into a diffusion LM with minimal compute.
100% open-source.
Here's why this matters:
Traditional autoregressive models generate text left-to-right, one token at a time. Diffusion models work differently - they refine the entire sequence iteratively, giving you better control over generation quality and more flexible editing capabilities.
You're in a Research Scientist interview at Google.
Interviewer: We have a base LLM that's terrible at maths. How would you turn it into a maths & reasoning powerhouse?
You: I'll get some problems labeled and fine-tune the model.
Interview over.
Here's what you missed:
When outputs are verifiable, labels become optional.
Maths, code, and logic can be automatically checked and validated.
Let's use this fact to build a reasoning model without manual labelling.
We'll use:
- @UnslothAI for parameter-efficient finetuning.
- @HuggingFace TRL to apply GRPO.
Let's go! π
What is GRPO?
Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.
Here's a brief overview of GRPO before we jump into code: