This release includes StripedHyena-Hessian-7B (SH 7B), a base model, & StripedHyena-Nous-7B (SH-N 7B), a chat model. Both use a hybrid architecture based on our latest on scaling laws of efficient architectures.
StripedHyena is the first alternative model competitive with the best open-source Transformers in short and long-context evaluations. Achieves comparable performance with Llama-2, Yi & Mistral 7B on OpenLLM leaderboard, outperforming on long-context summarization.
On short-context tasks, including OpenLLM leaderboard tasks, StripedHyena outperforms Llama-2 7B, Yi 7B and the strongest Transformer alternatives such as RWKV-Raven 14B:
StripedHyena is faster and more memory efficient for long sequence training, fine-tuning, and generation. Using our latest research on fast kernels for gated convolutions (FlashFFTConv) and on efficient Hyena inference, StripedHyena is >30%, >50%, and >100% faster.
StripedHyena is designed using our latest research on scaling laws of efficient architectures. In particular, StripedHyena is a hybrid of attention and gated convolutions arranged in Hyena operators. Via a compute-optimal scaling protocol, we identify several ways to improve.
StripedHyena is optimized using a set of new model grafting techniques, enabling us to change the model architecture during training. We grafted architectural components of Transformers and Hyena, and trained on a mix of the RedPajama dataset, augmented with longer-context data.
One additional advantage of StripedHyena is a >50% reduced memory footprint during autoregressive generation, compared to a Transformer (both with grouped-query attention).
This work would not have been possible without our collaborators @HazyResearch, @NousResearch, and @Hessian_AI.
It builds on our past work with @Mila_Quebec, @huggingface. We are grateful to open source AI community leaders including @AIatMeta, @AiEleuther, @MistralAI & others.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Announcing DeepCoder-14B – an o1 & o3-mini level coding reasoning model fully open-sourced!
We’re releasing everything: dataset, code, and training recipe.🔥
Built in collaboration with the @Agentica_ team.
See how we created it. 🧵
Training Technique
To scale reasoning without sacrificing the model’s long-context capability, we combine:
→ Iterative context lengthening
→ Overlong filtering (from DAPO)
We train DeepCoder-14B-Preview from 16K → 32K, then evaluate at 64K.
Results on LiveCodeBench:
• 16K: 54%
• 32K: 58%
• 64K: 60.6% (despite never training at 64K)
Our model is able to generalize to 64K context despite never getting trained on it, whereas baseline (R1-Distill-14B) plateaus beyond its training window.
Dataset Curation
Scaling reasoning with RL requires verifiable rewards. Unlike math, coding datasets online tend to be much noisier, resulting in faulty reward signals during training.
To address this, we’ve implemented a rigorous data pipeline:
• Official solutions must pass all tests
• ≥6 test cases per problem
• Deduplication across train/test splits
This pipeline gives us 24K high-quality verified coding problems for RL training.
Mixture of Agents—a framework that leverages the collective strengths of multiple LLMs. Each layer contains multiple agents that refine responses using outputs from the preceding layer.
Together MoA achieves a score of 65.1% on AlpacaEval 2.0. together.ai/blog/together-…
Together MoA exhibits promising performance on AlpacaEval 2.0 and MT-Bench.
Together MoA uses six open source models as proposers and Qwen1.5-110B-Chat as the final aggregators with three layers.
We also evaluate on FLASK which offers more fine-grained evaluation and outperforms original models on most of the dimensions.
The first RedPajama models are here! The 3B and 7B models are now available under Apache 2.0 license, including instruction-tuned and chat versions!
This project demonstrates the power of the open-source AI community with many contributors ... 🧵 together.xyz/blog/redpajama…
Training ran on 3,072 V100 GPUs provided as part of the INCITE 2023 project on Scalable Foundation Models for Transferrable Generalist AI, awarded to MILA, LAION, and EleutherAI in fall 2022, with support from the Oak Ridge Leadership Computing Facility (OLCF) and INCITE program.
Announcing RedPajama — a project to create leading, fully open-source large language models, beginning with the release of a 1.2 trillion token dataset that follows the LLaMA recipe, available today! together.xyz/blog/redpajama
More in 🧵 …
In the coming weeks we will release a full suite of large language models and instruction tuned versions based on this dataset.
Announcing OpenChatKit v0.16! You can now run OpenChatKit on consumer GPUs with a new 7B parameter model fine-tuned on user feedback for improved quality. And it's fast!
Details in 🧵 ... twitter.com/i/web/status/1…
Updates include: 1. A new 7B parameter, 8-bit quantized model, available for use on consumer GPUs huggingface.co/togethercomput…
Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions.
OpenChatKit includes 4 key components:
First, an instruction-tuned large language model, fine-tuned for chat from EleutherAI’s GPT-NeoX-20B with over 43 million instructions on 100% carbon negative compute available under Apache-2.0 license on @huggingface.