Sebastian Raschka Profile picture
AI & ML researcher. Author of the "Build a Large Language Model From Scratch" book (https://t.co/I4JVlXTbTw). LLM research engineer @LightningAI.
16 subscribers
Jul 1 8 tweets 3 min read
What's noteworthy in the newly released Gemma 2 LLMs?
The main theme is that they explore techniques w/o necessarily increasing the dataset sizes but rather focus on developing relatively small & efficient LLMs.
The are 3 main design choices to create the 2B & 9B models: Image 1) Sliding window attention (e.g., as popularized by Mistral): This technique uses a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below.
Jun 24 4 tweets 2 min read
I read about a fascinating hack to generate a high-quality dataset for LLM instruction finetuning this weekend. It's a fully automated way that doesn't require any seed questions and even runs locally. How does it work? Image Essentially, you just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for you. Then, feed that instruction back to the LLM, and it will generate a response for you. Image
May 25 8 tweets 3 min read
It's always exciting when a new paper with a LoRA-like method for efficient LLM finetuning comes out. In "MoRA: High-Rank Updating for Parameter-Efficient Finetuning ()," the authors take a related yet opposite approach to low-rank adaptation.
1/7 arxiv.org/abs/2405.12130
Image Why another LoRA alternative, now with high ranks? LoRA (I'd say that's by design) updates the original weights in a relatively limited well, which is great for tasks like instruction finetuning but relatively ineffective continued pretraining, which requires more capacity. 2/7 Image
Jul 9, 2023 4 tweets 1 min read
In the last couple of days, we talked a lot about extending the context window of transformer LLMs. Here's one more: "Extending Context Window of Large Language Models via Positional Interpolation"

1/3 Rotary positional embeddings (aka RoPE) have been a recent cornerstone of modern LLM implementations since it supports flexible sequence lengths. In this paper, researchers propose Position Interpolation to increase RoPE-based context window sizes to 32,768 tokens

2/3
Jun 5, 2023 5 tweets 2 min read
I often like to share AI research articles I find interesting. Let me change it up a bit and share one of our own articles that just passed peer review & got accepted.

Our method allows you to use any model (LLM, CNN, ViT, ...) for ordinal regression on ordered labels.

1/6 Image In short, it works by using a new loss function during training, such that each node in the output layer will represent a probability (whether the target label exceeds a given rank). This can then be converted in to the predicted rank label.

2/6 Image
May 30, 2023 6 tweets 2 min read
What happens if we train LLMs for multiple epochs?

The question I asked multiple times in the past finally got answered in this new preprint,
"To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis".

1/6 Image First, why would we want to consider training LLMs for multiple epochs given the enormous amounts of data? Turns that high-quality text data on the internet is more slower than required. Also, if copyrighted material is removed in the future, this it might shrink further

2/6 Image
Apr 25, 2023 7 tweets 2 min read
Instruction finetuning is how we get from GPT-3-like pretrained base models to more capable LLMs like ChatGPT. This requires human-generated instruction data like databricks-dolly-15k.
So, how do we scale this? One way is bootstrapping an LLM off its own generations.

1/7 Image Self-Instruct is one (almost annotation-free) way to align pretrained LLMs with instructions as illustrated in the figure above.

2/7
Apr 11, 2023 6 tweets 2 min read
In recent days, I shared various resources on finetuning large language models. Multiple people reached out asking what I think about in-context learning.

1/6 Image In-context learning (= providing examples of the task are provided in the input) is actually super useful when labeled data is scarce or inaccessible. And it's handy when we don't have direct access to the LLM, i.e, when interacting with the LLM via UI or API.

2/6
Apr 3, 2023 13 tweets 4 min read
Should we train our own large language model (LLM) on domain-specific data from scratch?
Researchers at Bloomberg did just that and shared a detailed technical report describing the dataset, model configuration, and training procedure.

1/9 In my experience, it makes total sense if we want to apply LLMs to novel data sources (e.g., protein amino acid sequences as ProtBERT demonstrated).
But how about adjacent data like finance articles? Let's take a look at "BloombergGPT" arxiv.org/abs/2303.17564

2/9
Apr 2, 2023 6 tweets 2 min read
Yesterday, I started discussing parameter-efficient finetuning methods for LLMs. Before I delve deeper into the topic and cover more methods in the next couple of days, I wanted to share a birds-eye view from the excellent "Scale Down to Scale Up" survey.

1/6 Prefix finetuning falls into the "soft prompt" category. In regular hard prompt tuning, we optimize the choice of input tokens to get the desired response.

2/6
Apr 1, 2023 4 tweets 1 min read
Finally, it’s done!! 🎉

Going the patent route instead of traditional peer review has been the best decision of my career so far.

Let me know if you have any questions about the process. Btw I will release the code under Apache 2.0 license. I am not planning to enforce it. I filed it to prevent big tech or patent trolls enforcing it.
Apr 1, 2023 6 tweets 2 min read
Yesterday, I covered the 3 classic ways to finetune LLMs. Let's now delve into parameter-efficient finetuning techniques.

Parameter Efficient Finetuning Part I: let's start with Prefix Finetuning.

1/6 Image The intuition is that a proper context can steer the LLM towards performing a desired task without the need of updating the LLM's parameters. We learn a set of tokens called a "prefix" that (conditioned on by the model) guides the model's output toward the desired behavior.

2/6
Mar 31, 2023 5 tweets 2 min read
How can we adapt and finetune large language models from an efficiency standpoint?

Yesterday, I discussed the recent LLaMA-Adapters. Many of you were curious about how this approach compares to the alternatives (eg low-rank adaption finetuning, prefix finetuning & others).

1/5 Image Before going into low-rank adaption finetuning & prefix finetuning next, let's take a step back and briefly go over the classic 3 (using a classification context as an example):

2/5 Image
Mar 30, 2023 5 tweets 3 min read
LLaMA-Adapter: finetuning large language models (LLMs) like LLaMA and matching Alpaca's modeling performance with greater finetuning efficiency

Let's have a look at this new paper (arxiv.org/abs/2303.16199) that proposes an adapter method for LLaMA instruction finetuning

1/5 In contrast to LLaMA-Alpaca, it's not finetuning the whole model end-to-end. Instead, the Adapter-approach adds a small number of 1.2M parameters on top of a pretrained, frozen 7B LLaMA model (as shown in the figure above)

2/5
Mar 28, 2023 7 tweets 3 min read
Large language models (LLMs) are getting better and better, and it is difficult to say whether we are approaching the limit of what pure text LLMs are capable of.

Either way, thinking about the next step for LLMs is interesting. The next trend ...
1/ Image The next trend will likely be extending the capabilities with vision, other modalities, and multitask training.

Last week, I discussed PaLM, a decoder-style language model for generating text. It's a strong alternative to GPT. Let's now take a look at the arxiv.org/abs/2303.03378twitter.com/i/web/status/1…
Mar 26, 2023 9 tweets 3 min read
How do we assess large language models (LLMs)?

Evaluating Large Language Models IV:
After discussing perplexity, BLEU, and ROUGE, the approaches are getting slightly better ... let's talk about BERTScore!

1/9 BERTScore can be used for translations and summaries, and it captures the semantic similarity better than traditional metrics like BLEU and ROUGE. In particular, it's more robust to paraphrasing.

2/9
Mar 25, 2023 9 tweets 2 min read
Understanding the shortcomings of large language models (LLMs) requires understanding the shortcomings of the underlying evaluation metrics.

Evaluating Large Language Models III:
After covering perplexity and BLEU. Let's now discuss ROUGE.

1/8 Where BLEU is commonly used for translation tasks, ROUGE is a popular metric for scoring text summaries. Similar to BLEU, it's usually applied to n-grams, but for simplicity, we will focus on 1-grams (single words). There are quite some similarities between BLEU and ROUGE.

2/8
Mar 24, 2023 9 tweets 3 min read
Evaluating Large Language Models II: Today, we are covering BLEU.

It's used in almost all large language models capable of translation, including popular tools such as OpenAI's Whisper and GPT-3.

1/9 BLEU was originally developed to capture or automate the essence of human evaluation of translated text.
The original BLEU paper (cs.cmu.edu/~jeanoh/16-785…) found a high correlation with human evaluations but this was later disproven.

2/9
Mar 23, 2023 8 tweets 2 min read
Yes, yes, large language models are everywhere! But how do we evaluate the quality of their generated text?

There are intrinsic metrics, such as perplexity, and extrinsic ones such as BLEU & ROUGE.

Let's start with the overview & perplexity.

1/6
Perplexity is closely related to the cross entropy that is directly minimized during training (intrinsic). BLEU and ROUGE are more related to e.g., classification accuracy (extrinsic), or rather precision & recall.

2/6
Mar 20, 2023 10 tweets 5 min read
PaLM is a really interesting decoder-style language model that I initially kind of ignored when it was published last year: arxiv.org/abs/2204.02311

Turns out PaLM has 7 interesting architecture improvements over GPT.

1/9
1) Multi-query attention: Different from multi-head attention, the key/value projections are shared for each head. (Same training time, but faster autoregressive decoding in inference).
(Ref: arxiv.org/abs/1911.02150)

2/9
Mar 19, 2023 6 tweets 2 min read
Such an eventful week, and I am just catching up with Alpaca, which deserves a big shoutout.

Alpaca is an instruction-finetuned 7B language transformer based on the 7B LLaMA GPT-3 alternative by Meta released a few weeks ago.
crfm.stanford.edu/2023/03/13/alp…

1/6 Instead of using human-generated instruction-output pairs, they retrieve the data by querying the GPT-3-based text-davinci-003 model. So, Alpaca essentially uses a form of weakly supervised or knowledge-distillation-flavored finetuning.*

2/6