🚨 Cerebras Inference is now 3x faster:
Llama3.1-70B just broke 2,100 tokens/s
- 16x faster than the fastest GPU solution
- 8x faster than GPUs running Llama *3B*
- It's like the perf of a new hardware generation in a single software release
Available now at inference.cerebras.ai
We broke all records when we launched Cerebras Inference in August. Today we are tripling our performance from 650 t/s to 2100 t/s.
Cerebras Inference speed is in a league of its own – 16x faster than the fastest GPU solution, 68x faster than hyperscale clouds, and 4-8x faster than other AI accelerators.
Time to first token is critical for real time applications. Cerebras is among the fastest in first token latency, showing the advantage of wafer scale integration vs. complex networked solutions.
Introducing Cerebras Inference
‣ Llama3.1-70B at 450 tokens/s – 20x faster than GPUs
‣ 60c per M tokens – a fifth the price of hyperscalers
‣ Full 16-bit precision for full model accuracy
‣ Generous rate limits for devs
Try now: inference.cerebras.ai
Cerebras Inference is the fastest Llama3.1 inference API by far: 1,800 tokens/s for 8B and 450tokens/s for 70B. We are ~20x faster than NVIDA GPUs and ~2x faster than Groq.
Going from 90 tokens/s to 1,800 tokens/s is like going from dialup to broadband. It makes AI instant:
Introducing BTLM-3B-8K: an open, state-of-the art 3B parameter model with 7B level performance. When quantized, it fits in as little as 3GB of memory 🤯. It runs on iPhone, Google Pixel, even Raspberry Pi. BTLM goes live on Bittensor later this week! 🧵👇
https://t.co/7aKLkeUeUIbuff.ly/3Q5dtY5
Today's popular models can run on a powerful PC but don't fit in popular mobile devices. In May @Opentensor challenged us to build a SoTA model that runs on any device and supports long context. Thus was born BTLM - a 3B model with 7B performance and 8K context length!
@opentensor BTLM sets a new standard in 3B performance. Thanks to its high quality training data (SlimPajama-627B), it outperforms 3B models trained on almost 2x the data.
It’s also the first model trained on the Condor Galaxy 1 AI supercomputer thanks to the support of G42 Cloud & IIAI!
📣 Today we are announcing Condor Galaxy-1: a 4 exaflop AI supercomputer built in partnership with @G42ai. Powered by 64 Cerebras CS-2 systems, 54M cores, and 82TB of memory – it's the largest AI supercomputer we've ever built. But that's not all: CG-1 is just the start..
G42 Cloud is the largest public cloud provider of the UAE. To expand its AI offering, we are planning not one but *nine* AI supercomputers. When complete in 2024, the full Condor Galaxy system will have 9 instances, 576 CS-2s, for a total of 36 exaFLOPs of AI compute. 🤯🤯
How does 36 exaFLOPs compare to other AI supercomputers? It's 4x the performance of Google's latest TPU Pod v4 and 9x the performance of Nvidia's yet complete Israel-1.
📣 New dataset drop!
Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵cerebras.net/blog/slimpajam…
RedPajama-1T is the largest open dataset today but contains a large percentage of duplicates, making a full training run costly and inefficient. Like the Falcon team, we found data quality is just as important as quantity – which led to SlimPajama. huggingface.co/datasets/cereb…
SlimPajama cleans and deduplicates RedPajama-1T, reducing the total token count and file size by 50%. It's half the size and trains twice as fast! It’s the highest quality dataset when training to 600B tokens and when upsampled performs equal or better than RedPajama.
🎉 Exciting news! Today we are releasing Cerebras-GPT, a family of 7 GPT models from 111M to 13B parameters trained using the Chinchilla formula. These are the highest accuracy models for a compute budget and are available today open-source! (1/5)
The AI industry is becoming increasingly closed. We believe in fostering open access to the most advanced models. Cerebras-GPT is being released under the Apache 2.0 license, allowing royalty-free use for research or commercial applications. (2/5)
One notable output of Cerebras-GPT is a new scaling law that predicts model performance for a given compute budget. This is the first scaling law derived using a public dataset. (3/5)