You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...👇
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.
dLLM is a Python library that unifies the training & evaluation of diffusion language models.
You can also use it to turn ANY autoregressive LM into a diffusion LM with minimal compute.
100% open-source.
Here's why this matters:
Traditional autoregressive models generate text left-to-right, one token at a time. Diffusion models work differently - they refine the entire sequence iteratively, giving you better control over generation quality and more flexible editing capabilities.
You're in a Research Scientist interview at Google.
Interviewer: We have a base LLM that's terrible at maths. How would you turn it into a maths & reasoning powerhouse?
You: I'll get some problems labeled and fine-tune the model.
Interview over.
Here's what you missed:
When outputs are verifiable, labels become optional.
Maths, code, and logic can be automatically checked and validated.
Let's use this fact to build a reasoning model without manual labelling.
We'll use:
- @UnslothAI for parameter-efficient finetuning.
- @HuggingFace TRL to apply GRPO.
Let's go! 🚀
What is GRPO?
Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.
Here's a brief overview of GRPO before we jump into code: