Releasing LLongMA-2 13b, a Llama-2 model, trained at 8k context length using linear positional interpolation scaling. The model was trained in collaboration with @theemozilla of @NousResearch and @kaiokendev1.
We worked directly with @kaiokendev1, to extend the context length of the Llama-2 13b model through fine-tuning. The model passes all our evaluations and maintains the same perplexity at 8k extrapolation surpassing the performance of other recent methodologies.
The model has identical performance to LLaMA 2 under 4k context length, performance scales directly to 8k, and works out-of-the-box with the new version of transformers (4.31) or with `trust_remote_code` for <= 4.30.
Applying the method to the rotary position embedding requires only slight changes to the model's code by dividing the positional index, t, by a scaling factor.
The repository containing @theemozilla’s implementation of scaled rotary embeddings can be found here: github.com/jquesnelle/sca…
If you would like to learn more about scaling rotary embeddings, I would strongly recommend reading @kaiokendev1's blog posts on his findings: kaiokendev.github.io
A PR to add scaled rotary embeddings to @huggingface transformers has been added by @joao_gante and merged: github.com/huggingface/tr…
The model was further trained for ~1 billion tokens on @togethercompute's Red Pajama dataset. The context length of the examples varies: huggingface.co/datasets/toget…
I would also recommend checking out the phenomenal research by @OfirPress on ALiBi which laid the foundation for many of these scaling techniques: arxiv.org/abs/2108.12409
It is also worth reviewing the paper, A Length-Extrapolatable Transformer, and xPos technique which also applies scaling to rotary embeddings: arxiv.org/pdf/2212.10554…
We previously trained the first publicly available model with rotary embedding scaling here:
The compute for this model release is all thanks to the generous sponsorship by @carperai, @EMostaque, and @StabilityAI. This is not an official @StabilityAI product.
A big thank you to @AiEleuther for facilitating the discussions about context-length extrapolation as well. Truly an awesome open-source team and community.
If you have any questions about the data or model be sure to reach out and ask! I will try to respond promptly.
The previous suite of LLongMA model releases can be found here:
Releasing LLongMA-2, a suite of Llama-2 models, trained at 8k context length using linear positional interpolation scaling. The model was trained in collaboration with @theemozilla of @NousResearch and @kaiokendev1. huggingface.co/conceptofmind/…
We worked directly with @kaiokendev1, to extend the context length of the Llama-2 7b model through fine-tuning. The models pass all our evaluations and maintain the same perplexity at 8k extrapolation surpassing the performance of other recent methodologies.
The model has similar performance to LLaMA 2 under 4k context length, performance scales directly to 8k, and works out-of-the-box with the new version of transformers (4.31) or with `trust_remote_code` for <= 4.30.
I worked with @ShayneRedford the main author of the FLAN collection to recreate his great work and publicly release high-quality instruction tuning data. We fixed encoding issues and also increased the sequence length to 4096.
Introducing three new open-source PaLM models trained at a context length of 8k on C4. Open-sourcing LLMs is a necessity for the fair and equitable democratization of AI. The models of sizes 150m, 410m, and 1b are available to download and use here: github.com/conceptofmind/…
The models are also compatible with many of Lucidrain's popular repositories such as Toolformer-pytorch, PaLM-rlhf-pytorch, and PaLM-pytorch. Please be sure to sponsor and help support Phil's great work: github.com/lucidrains/PaL…
Our work on Toolformer, PaLM, and related projects is all thanks to the generous sponsorship by @carperai and @StabilityAI.