We're excited to partner with @Cognition_Labs @Mercor_AI @CoreWeave and @AnthropicAI to host an inference-time compute hackathon, featuring >$60K in cash prizes and >1 exaflop of free compute.
Each team accepted gets a free 8xH100, exclusive access to Cognition's API, and Anthropic credits for 24 hours.
We'll pick the winner by the best application of inference-time compute, judged by a panel of researchers from major AI projects.
Oct 31, 2024 • 7 tweets • 3 min read
Introducing Oasis: the first playable AI-generated game.
We partnered with @DecartAI to build a real-time, interactive world model that runs >10x faster on Sohu. We're open-sourcing the model architecture, weights, and research.
Here's how it works (and a demo you can play!):
Oasis generates frames based on your keyboard inputs. You can move and jump around, break blocks, and build and explore a brand new map every game.
Jun 25, 2024 • 6 tweets • 3 min read
Meet Sohu, the fastest AI chip of all time.
With over 500,000 tokens per second running Llama 70B, Sohu lets you build products that are impossible on GPUs. One 8xSohu server replaces 160 H100s.
Sohu is the first specialized chip (ASIC) for transformer models. By specializing, we get way more performance: Sohu can’t run CNNs, LSTMs, SSMs, or any other AI models.
Today, every major AI product (ChatGPT, Claude, Gemini, Sora) is powered by transformers. Within a few years, every large AI model will run on custom chips.
Here’s why specialized chips are inevitable:
Sohu is >10x faster and cheaper than even NVIDIA’s next-generation Blackwell (B200) GPUs.
One Sohu server runs over 500,000 Llama 70B tokens per second, 20x more than an H100 server (23,000 tokens/sec), and 10x more than a B200 server (~45,000 tokens/sec).
Benchmarks are from running in FP8 without sparsity at 8x model parallelism with 2048 input/128 output lengths. 8xH100s figures are from TensorRT-LLM 0.10.08 (latest version), and 8xB200 figures are estimated. This is the same benchmark NVIDIA and AMD use.