Sebastian Raschka Profile picture
Aug 28 3 tweets 1 min read Read on X
I’ve been working on something new:
📚 Build a Reasoning Model (From Scratch).

The first chapters just went live!

(The book will cover topics from inference-time scaling to reinforcement learning) Image
If you want to check it out, it's available for pre-order @ManningBooks here:

You will get immediate access to the first chapters, each new chapter as it's released, and the full book once complete.mng.bz/Dwra
@ManningBooks And it's currently 50% off for the first 2 weeks!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Raschka

Sebastian Raschka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rasbt

Jul 1, 2024
What's noteworthy in the newly released Gemma 2 LLMs?
The main theme is that they explore techniques w/o necessarily increasing the dataset sizes but rather focus on developing relatively small & efficient LLMs.
The are 3 main design choices to create the 2B & 9B models: Image
1) Sliding window attention (e.g., as popularized by Mistral): This technique uses a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below.
2) Group-query attention (like in Llama 2 and 3): This can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements.
Read 8 tweets
Jun 24, 2024
I read about a fascinating hack to generate a high-quality dataset for LLM instruction finetuning this weekend. It's a fully automated way that doesn't require any seed questions and even runs locally. How does it work? Image
Essentially, you just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for you. Then, feed that instruction back to the LLM, and it will generate a response for you. Image
What's fascinating is that with the resulting instruction dataset, you can finetune a Llama 3 8B base model with just instruction finetuning, no preference finetuning via RLHF and DPO, and it still beats the original Llama 3 8B Instruct model by Meta AI.
For more information, see the "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing" paper: arxiv.org/abs/2406.08464
Read 4 tweets
May 25, 2024
It's always exciting when a new paper with a LoRA-like method for efficient LLM finetuning comes out. In "MoRA: High-Rank Updating for Parameter-Efficient Finetuning ()," the authors take a related yet opposite approach to low-rank adaptation.
1/7 arxiv.org/abs/2405.12130
Image
Why another LoRA alternative, now with high ranks? LoRA (I'd say that's by design) updates the original weights in a relatively limited well, which is great for tasks like instruction finetuning but relatively ineffective continued pretraining, which requires more capacity. 2/7 Image
So, in the MoRA paper, the authors seek to develop a parameter-efficient finetuning method that can perform well for both instruction finetuning AND absorbing new knowledge in continued pretraining.
3/7
Read 8 tweets
Jul 9, 2023
In the last couple of days, we talked a lot about extending the context window of transformer LLMs. Here's one more: "Extending Context Window of Large Language Models via Positional Interpolation"

1/3
Rotary positional embeddings (aka RoPE) have been a recent cornerstone of modern LLM implementations since it supports flexible sequence lengths. In this paper, researchers propose Position Interpolation to increase RoPE-based context window sizes to 32,768 tokens

2/3
This sounds a bit more humble than the recent 1M or 1B context lengths claims.

But this requires only minimal (1000 steps) finetuning. And it allows long document summarization using LLaMA 7B and 65B models, for example.

Link to the paper here:

3/3arxiv.org/abs/2306.15595
Read 4 tweets
Jun 5, 2023
I often like to share AI research articles I find interesting. Let me change it up a bit and share one of our own articles that just passed peer review & got accepted.

Our method allows you to use any model (LLM, CNN, ViT, ...) for ordinal regression on ordered labels.

1/6 Image
In short, it works by using a new loss function during training, such that each node in the output layer will represent a probability (whether the target label exceeds a given rank). This can then be converted in to the predicted rank label.

2/6 Image
This CORN (Conditional Ordinal Regression for Neural Networks) loss works with any model that is typically trained with a cross-entropy loss. But yeah, on ordered labels it can perform much better than standard cross entropy.

3/6 Image
Read 5 tweets
May 30, 2023
What happens if we train LLMs for multiple epochs?

The question I asked multiple times in the past finally got answered in this new preprint,
"To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis".

1/6 Image
First, why would we want to consider training LLMs for multiple epochs given the enormous amounts of data? Turns that high-quality text data on the internet is more slower than required. Also, if copyrighted material is removed in the future, this it might shrink further

2/6 Image
So, why not training for multiple epochs on existing data?

The result is that training for multiple epochs leads to overfitting: it gets more severe the larger the model and the smaller the dataset -- this is consistent with common deep learning experiences.

3/6 Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(