Post

More from @rasbt

Sebastian Raschka

@rasbt

Jul 1, 2024

What's noteworthy in the newly released Gemma 2 LLMs?
The main theme is that they explore techniques w/o necessarily increasing the dataset sizes but rather focus on developing relatively small & efficient LLMs.
The are 3 main design choices to create the 2B & 9B models:

1) Sliding window attention (e.g., as popularized by Mistral): This technique uses a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below.

2) Group-query attention (like in Llama 2 and 3): This can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements.

Read 8 tweets

Sebastian Raschka

@rasbt

Jun 24, 2024

I read about a fascinating hack to generate a high-quality dataset for LLM instruction finetuning this weekend. It's a fully automated way that doesn't require any seed questions and even runs locally. How does it work?

Essentially, you just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for you. Then, feed that instruction back to the LLM, and it will generate a response for you.

What's fascinating is that with the resulting instruction dataset, you can finetune a Llama 3 8B base model with just instruction finetuning, no preference finetuning via RLHF and DPO, and it still beats the original Llama 3 8B Instruct model by Meta AI.
For more information, see the "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing" paper: arxiv.org/abs/2406.08464

Read 4 tweets

Sebastian Raschka

@rasbt

May 25, 2024

It's always exciting when a new paper with a LoRA-like method for efficient LLM finetuning comes out. In "MoRA: High-Rank Updating for Parameter-Efficient Finetuning ()," the authors take a related yet opposite approach to low-rank adaptation.
1/7 arxiv.org/abs/2405.12130

Why another LoRA alternative, now with high ranks? LoRA (I'd say that's by design) updates the original weights in a relatively limited well, which is great for tasks like instruction finetuning but relatively ineffective continued pretraining, which requires more capacity. 2/7

So, in the MoRA paper, the authors seek to develop a parameter-efficient finetuning method that can perform well for both instruction finetuning AND absorbing new knowledge in continued pretraining.
3/7

Read 8 tweets

Sebastian Raschka

@rasbt

Jul 9, 2023

In the last couple of days, we talked a lot about extending the context window of transformer LLMs. Here's one more: "Extending Context Window of Large Language Models via Positional Interpolation"

1/3

Rotary positional embeddings (aka RoPE) have been a recent cornerstone of modern LLM implementations since it supports flexible sequence lengths. In this paper, researchers propose Position Interpolation to increase RoPE-based context window sizes to 32,768 tokens

2/3

This sounds a bit more humble than the recent 1M or 1B context lengths claims.

But this requires only minimal (1000 steps) finetuning. And it allows long document summarization using LLaMA 7B and 65B models, for example.

Link to the paper here:

3/3arxiv.org/abs/2306.15595

Read 4 tweets

Sebastian Raschka

@rasbt

Jun 5, 2023

I often like to share AI research articles I find interesting. Let me change it up a bit and share one of our own articles that just passed peer review & got accepted.

Our method allows you to use any model (LLM, CNN, ViT, ...) for ordinal regression on ordered labels.

1/6

In short, it works by using a new loss function during training, such that each node in the output layer will represent a probability (whether the target label exceeds a given rank). This can then be converted in to the predicted rank label.

2/6

This CORN (Conditional Ordinal Regression for Neural Networks) loss works with any model that is typically trained with a cross-entropy loss. But yeah, on ordered labels it can perform much better than standard cross entropy.

3/6

Read 5 tweets

Sebastian Raschka

@rasbt

May 30, 2023

What happens if we train LLMs for multiple epochs?

The question I asked multiple times in the past finally got answered in this new preprint,
"To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis".

1/6

First, why would we want to consider training LLMs for multiple epochs given the enormous amounts of data? Turns that high-quality text data on the internet is more slower than required. Also, if copyrighted material is removed in the future, this it might shrink further

2/6

So, why not training for multiple epochs on existing data?

The result is that training for multiple epochs leads to overfitting: it gets more severe the larger the model and the smaller the dataset -- this is consistent with common deep learning experiences.

3/6

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

Sebastian Raschka

Try unrolling a thread yourself!

More from @rasbt

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Sebastian Raschka

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!