Sebastian Raschka Profile picture
Feb 25, 2023 8 tweets 3 min read Read on X
Takeaways from reading the "LLaMa: Open and Efficient Foundation Language Models" paper that made big waves yesterday.
It's a laudable open-source effort making large language models available* for research purposes, but there are also some missed research opportunities here
1/8
Inspired by Chinchilla's scaling laws paper, the LLaMA paper proposes a set of "small" large language models, trained on only public data, that outperform GPT-3 (175B) with >10x fewer parameters (13B). And there's a larger 65B version that outperforms PaLM-540B.
2/8
The LLaMA models, are a welcome alternative to previous open source models like OPT and BLOOM, which are both said to underperform GPT-3.
What are some of the methods they used to achieve this performance?
3/8
They reference Pre-normalization, SwiGLU activations, and Rotary Embeddings as techniques to improve the performance of the LLaMA models. Since these are research models, I would have loved to see ablation studies -- I feel like this is a missed opportunity.
4/8
Moreover, the plots show that a steep negative slope when showing the training loss versus the number of training tokens. I cannot help but wonder what would happen if they trained the model for more than 1-2 epochs.
5/8
Moreover, from reading the research paper, it's not clear what the architecture looks like exactly. The referenced "Attention is all you need" architecture is an encoder-decoder architecture. GPT-3 that they compare themselves to is a decoder-only architecture.
6/8
*Now let's get to the asterisk of the first tweet. The model repo is available under a GNU GPL v3.0 license on GitHub here: github.com/facebookresear….
It contains the code only. The weights are available upon filing a request form.
7/8
While I think it's fair (no pun intended), it should be mentioned that this comes with a pretty hefty restriction:
"The license prohibits using the models or any data produced by the models for any type of commercial or production purpose."
8/8

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Raschka

Sebastian Raschka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rasbt

Jul 1
What's noteworthy in the newly released Gemma 2 LLMs?
The main theme is that they explore techniques w/o necessarily increasing the dataset sizes but rather focus on developing relatively small & efficient LLMs.
The are 3 main design choices to create the 2B & 9B models: Image
1) Sliding window attention (e.g., as popularized by Mistral): This technique uses a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below.
2) Group-query attention (like in Llama 2 and 3): This can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements.
Read 8 tweets
Jun 24
I read about a fascinating hack to generate a high-quality dataset for LLM instruction finetuning this weekend. It's a fully automated way that doesn't require any seed questions and even runs locally. How does it work? Image
Essentially, you just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for you. Then, feed that instruction back to the LLM, and it will generate a response for you. Image
What's fascinating is that with the resulting instruction dataset, you can finetune a Llama 3 8B base model with just instruction finetuning, no preference finetuning via RLHF and DPO, and it still beats the original Llama 3 8B Instruct model by Meta AI.
For more information, see the "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing" paper: arxiv.org/abs/2406.08464
Read 4 tweets
May 25
It's always exciting when a new paper with a LoRA-like method for efficient LLM finetuning comes out. In "MoRA: High-Rank Updating for Parameter-Efficient Finetuning ()," the authors take a related yet opposite approach to low-rank adaptation.
1/7 arxiv.org/abs/2405.12130
Image
Why another LoRA alternative, now with high ranks? LoRA (I'd say that's by design) updates the original weights in a relatively limited well, which is great for tasks like instruction finetuning but relatively ineffective continued pretraining, which requires more capacity. 2/7 Image
So, in the MoRA paper, the authors seek to develop a parameter-efficient finetuning method that can perform well for both instruction finetuning AND absorbing new knowledge in continued pretraining.
3/7
Read 8 tweets
Jul 9, 2023
In the last couple of days, we talked a lot about extending the context window of transformer LLMs. Here's one more: "Extending Context Window of Large Language Models via Positional Interpolation"

1/3
Rotary positional embeddings (aka RoPE) have been a recent cornerstone of modern LLM implementations since it supports flexible sequence lengths. In this paper, researchers propose Position Interpolation to increase RoPE-based context window sizes to 32,768 tokens

2/3
This sounds a bit more humble than the recent 1M or 1B context lengths claims.

But this requires only minimal (1000 steps) finetuning. And it allows long document summarization using LLaMA 7B and 65B models, for example.

Link to the paper here:

3/3arxiv.org/abs/2306.15595
Read 4 tweets
Jun 5, 2023
I often like to share AI research articles I find interesting. Let me change it up a bit and share one of our own articles that just passed peer review & got accepted.

Our method allows you to use any model (LLM, CNN, ViT, ...) for ordinal regression on ordered labels.

1/6 Image
In short, it works by using a new loss function during training, such that each node in the output layer will represent a probability (whether the target label exceeds a given rank). This can then be converted in to the predicted rank label.

2/6 Image
This CORN (Conditional Ordinal Regression for Neural Networks) loss works with any model that is typically trained with a cross-entropy loss. But yeah, on ordered labels it can perform much better than standard cross entropy.

3/6 Image
Read 5 tweets
May 30, 2023
What happens if we train LLMs for multiple epochs?

The question I asked multiple times in the past finally got answered in this new preprint,
"To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis".

1/6 Image
First, why would we want to consider training LLMs for multiple epochs given the enormous amounts of data? Turns that high-quality text data on the internet is more slower than required. Also, if copyrighted material is removed in the future, this it might shrink further

2/6 Image
So, why not training for multiple epochs on existing data?

The result is that training for multiple epochs leads to overfitting: it gets more severe the larger the model and the smaller the dataset -- this is consistent with common deep learning experiences.

3/6 Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(