EXO Labs Profile picture
Oct 15 5 tweets 2 min read Read on X
Clustering NVIDIA DGX Spark + M3 Ultra Mac Studio for 4x faster LLM inference.

DGX Spark: 128GB @ 273GB/s, 100 TFLOPS (fp16), $3,999
M3 Ultra: 256GB @ 819GB/s, 26 TFLOPS (fp16), $5,599

The DGX Spark has 3x less memory bandwidth than the M3 Ultra but 4x more FLOPS.

By running compute-bound prefill on the DGX Spark, memory-bound decode on the M3 Ultra, and streaming the KV cache over 10GbE, we are able to get the best of both hardware with massive speedups.

Short explanation in this thread & link to full blog post below.Image
LLM inference consists of a prefill and decode stage.

Prefill processes the prompt, building a KV cache. It’s compute-bound so gets faster with more FLOPS.

Decode reads the KV cache and generates tokens one by one. It’s memory-bound so gets faster with more memory bandwidth.
We can run these two stages on different devices:

Prefill: DGX Spark (high compute device, 4x compute)
Decode: M3 Ultra (high memory-bandwidth device, 3x memory-bandwidth)

However, now we need to transfer the KV cache over the network (10GbE). This introduces a delay.
But the KV cache is created for each transformer layer. By sending each layer’s KV cache after it’s computed, we overlap communication with computation.

We stream the KV cache and hide the network delay.

We achieve a 4x speedup in prefill & 3x in decode, with 0 network delay.
Full blog post and more details about EXO 1.0:

Thanks @NVIDIA for early access to two DGX Sparks. #SparkSomethingBigblog.exolabs.net/nvidia-dgx-spa…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with EXO Labs

EXO Labs Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(