Post

Jordan Nanos

@JordanNanos

Nov 6 • 16 tweets • 7 min read • Read on X

https://x.com/JordanNanos/status/1986480880236569000

ClusterMAX 2.0 is here!

In this update we rate 84 neoclouds, up from 26
Our market view tracks 209 providers, up from 169 in March and 124 last October
We spoke to over 140 end users as part of the research
And we wrote over 43,000 words about our experience

This is more than Animal Farm, but less than The Great Gatsby

More thoughts in 🧵

https://x.com/JordanNanos/status/1986480880236569000

In the first version of clustermax, we described key criteria across 10 categories. This time, we itemize a list of things we look for.

You can read the list at clustermax.ai/criteria

We also released 5 descriptions of our “expectations” for slurm, kubernetes, standalone VMs, monitoring, and health checks

Itemized criteria can be less impactful than explaining exactly what makes a cluster usable

{slurm,k8s,standalone,monitoring,health-checks}clustermax.ai

We also provide some thoughts on key trends:
Slurm-on-Kubernetes
VMs vs Bare Metal
Kubernetes for Training
Transition to Blackwell
GB200 NVL72 Reliability and SLA’s
Crypto Miners Here To Stay
Custom Storage Solutions
InfiniBand Security
Container Escapes, Embargo Programs, Pentesting and Auditing

I’ll go through a few of them here

For SonK there are basically three options:

1. CoreWeave leads the way with SUNK (which is closed source, but can be licensed)
2. Nebius follows close behind with SOperator (which is open source, and at least two other clouds use forks of, namely Voltage Park and GCORE)
3. Slinky, from SchedMD, the creators of SLURM (which many clouds use or fork)

There are meaningful differences between the three approaches, but it’s clear that SonK is here to stay thanks to the infrastructure lifecycle benefits of k8s being married with the end user ease-of-use benefits of slurm.

(screenshots included of the slinky and Soperator RA’s)

VMs vs Bare Metal

There is still an ongoing debate amongst the top providers. CoreWeave and Oracle use bare metal. Nebius uses kubevirt VMs (i.e. VMs on k8s, or if they’re making a managed Soperator cluster its slurm-on-VMs-on-k8s). Crusoe uses cloud-hypervisor VMs (no k8s involved). Fluidstack takes what you give them.

It’s interesting that there isn’t a settled best practice here, just tradeoffs.

Kubernetes for Training

We are seeing more k8s for training, but nothing is simple yet. Everything still kind of sends you to yaml hell. Kueue, Volcano, PyTorchJob, MPIOperator, KubeFlow, Jobset, Trainy, Skypilot… somebody fix this

GB200 NVL72 Reliability and SLA’s

Probably the most active topic for providers. The reliability of NVL72 or NVL 36x2 (shown below) is really hard to contend with. Also NVIDIA really doesn’t want to talk about the 168 ACC cables that run between the racks, or firmware version 1.3 that just shipped a few weeks ago to fix some issues that have been ongoing for 6-7 months.

We also compared some SLAs that providers are settling on, for the node, rack and control plane level when selling this racks. It’s really crazy that customers basically have no choice but to accept "one 9" of uptime in their SLA.

To quote a friend that works in the industry "anyone quoting 72 out of 72 GPUs available for 99% uptime is insane and definitely losing money".

InfiniBand Security

After a bunch of pressure, NVIDIA finally released a blog on this topic, explaining how to (for lack of a better description) make a VLAN on InfiniBand.

But it’s not that simple. To make a partition on IB, you need to set a bunch of keys, and in our tests providers just… don’t. I got a bunch of 2 and 4 node clusters with and expected 16 or 32 endpoints in my partition, but I could run sudo ibhosts or grep the local ibdiagnet2.pkey file and see 500+ endpoints.

Please set a per-tenant P_Key, M_Key, VS_Key, C_Key, N2N_Key, SA_Key and AM_Key (if using SHARP)

NVIDIA blog fwiw: developer.nvidia.com/blog/infiniban…

Container Escapes, Embargo Programs, Pentesting and Auditing

Security is a checkbox and annoying, but when @wiz posts about CVE-2024-0132 in February and CVE-2025-23266 (9.0 CVSS) in July we expected providers to be able to notify their users, schedule a maintenance window, and apply the patch.

But when we explicitly tell providers about these exploits, share that we’re going to check for them, and then still get a 9 month old version of nct that is vulnerable, it’s annoying.

We recommend for all providers to join NVIDIA’s embargo program so that they’re prepared when something like this inevitably happens again. We’re also glad that AMD has established a similar embargo program for their neocloud partners following our feedback.

wiz blog: wiz.io/blog/nvidia-ai…

and a description of the exploit:

If you found this thread interesting and want to contribute to our research, please apply for a job at SemiAnalysis, or just send me a dm

We’d love your feedback either way!

x.com/i/jobs/1967722…

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Jordan Nanos

Try unrolling a thread yourself!

More from @JordanNanos

Jordan Nanos

Jordan Nanos

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!