Takeaways from reading the "LLaMa: Open and Efficient Foundation Language Models" paper that made big waves yesterday.
It's a laudable open-source effort making large language models available* for research purposes, but there are also some missed research opportunities here 1/8
Inspired by Chinchilla's scaling laws paper, the LLaMA paper proposes a set of "small" large language models, trained on only public data, that outperform GPT-3 (175B) with >10x fewer parameters (13B). And there's a larger 65B version that outperforms PaLM-540B. 2/8
The LLaMA models, are a welcome alternative to previous open source models like OPT and BLOOM, which are both said to underperform GPT-3.
What are some of the methods they used to achieve this performance?
3/8
They reference Pre-normalization, SwiGLU activations, and Rotary Embeddings as techniques to improve the performance of the LLaMA models. Since these are research models, I would have loved to see ablation studies -- I feel like this is a missed opportunity. 4/8
Moreover, the plots show that a steep negative slope when showing the training loss versus the number of training tokens. I cannot help but wonder what would happen if they trained the model for more than 1-2 epochs. 5/8
Moreover, from reading the research paper, it's not clear what the architecture looks like exactly. The referenced "Attention is all you need" architecture is an encoder-decoder architecture. GPT-3 that they compare themselves to is a decoder-only architecture. 6/8
*Now let's get to the asterisk of the first tweet. The model repo is available under a GNU GPL v3.0 license on GitHub here: github.com/facebookresear….
It contains the code only. The weights are available upon filing a request form.
7/8
While I think it's fair (no pun intended), it should be mentioned that this comes with a pretty hefty restriction:
"The license prohibits using the models or any data produced by the models for any type of commercial or production purpose." 8/8
• • •
Missing some Tweet in this thread? You can try to
force a refresh
What's noteworthy in the newly released Gemma 2 LLMs?
The main theme is that they explore techniques w/o necessarily increasing the dataset sizes but rather focus on developing relatively small & efficient LLMs.
The are 3 main design choices to create the 2B & 9B models:
1) Sliding window attention (e.g., as popularized by Mistral): This technique uses a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below.
2) Group-query attention (like in Llama 2 and 3): This can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements.
I read about a fascinating hack to generate a high-quality dataset for LLM instruction finetuning this weekend. It's a fully automated way that doesn't require any seed questions and even runs locally. How does it work?
Essentially, you just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for you. Then, feed that instruction back to the LLM, and it will generate a response for you.
What's fascinating is that with the resulting instruction dataset, you can finetune a Llama 3 8B base model with just instruction finetuning, no preference finetuning via RLHF and DPO, and it still beats the original Llama 3 8B Instruct model by Meta AI.
For more information, see the "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing" paper: arxiv.org/abs/2406.08464
It's always exciting when a new paper with a LoRA-like method for efficient LLM finetuning comes out. In "MoRA: High-Rank Updating for Parameter-Efficient Finetuning ()," the authors take a related yet opposite approach to low-rank adaptation. 1/7 arxiv.org/abs/2405.12130
Why another LoRA alternative, now with high ranks? LoRA (I'd say that's by design) updates the original weights in a relatively limited well, which is great for tasks like instruction finetuning but relatively ineffective continued pretraining, which requires more capacity. 2/7
So, in the MoRA paper, the authors seek to develop a parameter-efficient finetuning method that can perform well for both instruction finetuning AND absorbing new knowledge in continued pretraining.
3/7
In the last couple of days, we talked a lot about extending the context window of transformer LLMs. Here's one more: "Extending Context Window of Large Language Models via Positional Interpolation"
1/3
Rotary positional embeddings (aka RoPE) have been a recent cornerstone of modern LLM implementations since it supports flexible sequence lengths. In this paper, researchers propose Position Interpolation to increase RoPE-based context window sizes to 32,768 tokens
2/3
This sounds a bit more humble than the recent 1M or 1B context lengths claims.
But this requires only minimal (1000 steps) finetuning. And it allows long document summarization using LLaMA 7B and 65B models, for example.
I often like to share AI research articles I find interesting. Let me change it up a bit and share one of our own articles that just passed peer review & got accepted.
Our method allows you to use any model (LLM, CNN, ViT, ...) for ordinal regression on ordered labels.
1/6
In short, it works by using a new loss function during training, such that each node in the output layer will represent a probability (whether the target label exceeds a given rank). This can then be converted in to the predicted rank label.
2/6
This CORN (Conditional Ordinal Regression for Neural Networks) loss works with any model that is typically trained with a cross-entropy loss. But yeah, on ordered labels it can perform much better than standard cross entropy.
What happens if we train LLMs for multiple epochs?
The question I asked multiple times in the past finally got answered in this new preprint,
"To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis".
1/6
First, why would we want to consider training LLMs for multiple epochs given the enormous amounts of data? Turns that high-quality text data on the internet is more slower than required. Also, if copyrighted material is removed in the future, this it might shrink further
2/6
So, why not training for multiple epochs on existing data?
The result is that training for multiple epochs leads to overfitting: it gets more severe the larger the model and the smaller the dataset -- this is consistent with common deep learning experiences.