Jordan Nanos Profile picture
Nov 6 16 tweets 7 min read Read on X
ClusterMAX 2.0 is here!

In this update we rate 84 neoclouds, up from 26
Our market view tracks 209 providers, up from 169 in March and 124 last October
We spoke to over 140 end users as part of the research
And we wrote over 43,000 words about our experience

This is more than Animal Farm, but less than The Great Gatsby

More thoughts in 🧵
In the first version of clustermax, we described key criteria across 10 categories. This time, we itemize a list of things we look for.

You can read the list at clustermax.ai/criteriaImage
We also released 5 descriptions of our “expectations” for slurm, kubernetes, standalone VMs, monitoring, and health checks

Itemized criteria can be less impactful than explaining exactly what makes a cluster usable

{slurm,k8s,standalone,monitoring,health-checks}clustermax.aiImage
We also provide some thoughts on key trends:
Slurm-on-Kubernetes
VMs vs Bare Metal
Kubernetes for Training
Transition to Blackwell
GB200 NVL72 Reliability and SLA’s
Crypto Miners Here To Stay
Custom Storage Solutions
InfiniBand Security
Container Escapes, Embargo Programs, Pentesting and Auditing

I’ll go through a few of them here
For SonK there are basically three options:

1. CoreWeave leads the way with SUNK (which is closed source, but can be licensed)
2. Nebius follows close behind with SOperator (which is open source, and at least two other clouds use forks of, namely Voltage Park and GCORE)
3. Slinky, from SchedMD, the creators of SLURM (which many clouds use or fork)
There are meaningful differences between the three approaches, but it’s clear that SonK is here to stay thanks to the infrastructure lifecycle benefits of k8s being married with the end user ease-of-use benefits of slurm.

(screenshots included of the slinky and Soperator RA’s) Image
Image
VMs vs Bare Metal

There is still an ongoing debate amongst the top providers. CoreWeave and Oracle use bare metal. Nebius uses kubevirt VMs (i.e. VMs on k8s, or if they’re making a managed Soperator cluster its slurm-on-VMs-on-k8s). Crusoe uses cloud-hypervisor VMs (no k8s involved). Fluidstack takes what you give them.

It’s interesting that there isn’t a settled best practice here, just tradeoffs.
Kubernetes for Training

We are seeing more k8s for training, but nothing is simple yet. Everything still kind of sends you to yaml hell. Kueue, Volcano, PyTorchJob, MPIOperator, KubeFlow, Jobset, Trainy, Skypilot… somebody fix this
GB200 NVL72 Reliability and SLA’s

Probably the most active topic for providers. The reliability of NVL72 or NVL 36x2 (shown below) is really hard to contend with. Also NVIDIA really doesn’t want to talk about the 168 ACC cables that run between the racks, or firmware version 1.3 that just shipped a few weeks ago to fix some issues that have been ongoing for 6-7 months.Image
Image
We also compared some SLAs that providers are settling on, for the node, rack and control plane level when selling this racks. It’s really crazy that customers basically have no choice but to accept "one 9" of uptime in their SLA.

To quote a friend that works in the industry "anyone quoting 72 out of 72 GPUs available for 99% uptime is insane and definitely losing money".Image
InfiniBand Security

After a bunch of pressure, NVIDIA finally released a blog on this topic, explaining how to (for lack of a better description) make a VLAN on InfiniBand.

But it’s not that simple. To make a partition on IB, you need to set a bunch of keys, and in our tests providers just… don’t. I got a bunch of 2 and 4 node clusters with and expected 16 or 32 endpoints in my partition, but I could run sudo ibhosts or grep the local ibdiagnet2.pkey file and see 500+ endpoints.

Please set a per-tenant P_Key, M_Key, VS_Key, C_Key, N2N_Key, SA_Key and AM_Key (if using SHARP)Image
Image
Image
Container Escapes, Embargo Programs, Pentesting and Auditing

Security is a checkbox and annoying, but when @wiz posts about CVE-2024-0132 in February and CVE-2025-23266 (9.0 CVSS) in July we expected providers to be able to notify their users, schedule a maintenance window, and apply the patch.

But when we explicitly tell providers about these exploits, share that we’re going to check for them, and then still get a 9 month old version of nct that is vulnerable, it’s annoying.
We recommend for all providers to join NVIDIA’s embargo program so that they’re prepared when something like this inevitably happens again. We’re also glad that AMD has established a similar embargo program for their neocloud partners following our feedback.

wiz blog: wiz.io/blog/nvidia-ai…
and a description of the exploit: Image
If you found this thread interesting and want to contribute to our research, please apply for a job at SemiAnalysis, or just send me a dm

We’d love your feedback either way!

x.com/i/jobs/1967722…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jordan Nanos

Jordan Nanos Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @JordanNanos

Oct 23
Recording from our talk at OCP last week has been uploaded @dnishball

- What We Do @SemiAnalysis_
- Key Trends
- InferenceMAX
- ClusterMAX
- Tokenomics

Key Trends from chip design to datacenter infrastructure Image
Image
And introduction to inferencemax.aiImage
Image
Read 6 tweets
Mar 23
Last week at @NVIDIA GTC I had a talk on what it takes to build a GPU cluster with @HPE and @HPE_Cray

It breaks down into a 12 step program 🧵 Image
The full contents of the presentation looks like this, walking through the 12 steps one-by-one: Image
1. Sizing and Design

There are two ways to determine how big the cluster is going to be.

Based on a given application that needs to be supported, or how many GPUs can fit in my datacenter.

I describe two example sizing exercises I did recently for a RAG and ASR application: Image
Image
Image
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(