In this update we rate 84 neoclouds, up from 26
Our market view tracks 209 providers, up from 169 in March and 124 last October
We spoke to over 140 end users as part of the research
And we wrote over 43,000 words about our experience
This is more than Animal Farm, but less than The Great Gatsby
We also provide some thoughts on key trends:
Slurm-on-Kubernetes
VMs vs Bare Metal
Kubernetes for Training
Transition to Blackwell
GB200 NVL72 Reliability and SLA’s
Crypto Miners Here To Stay
Custom Storage Solutions
InfiniBand Security
Container Escapes, Embargo Programs, Pentesting and Auditing
I’ll go through a few of them here
For SonK there are basically three options:
1. CoreWeave leads the way with SUNK (which is closed source, but can be licensed) 2. Nebius follows close behind with SOperator (which is open source, and at least two other clouds use forks of, namely Voltage Park and GCORE) 3. Slinky, from SchedMD, the creators of SLURM (which many clouds use or fork)
There are meaningful differences between the three approaches, but it’s clear that SonK is here to stay thanks to the infrastructure lifecycle benefits of k8s being married with the end user ease-of-use benefits of slurm.
(screenshots included of the slinky and Soperator RA’s)
VMs vs Bare Metal
There is still an ongoing debate amongst the top providers. CoreWeave and Oracle use bare metal. Nebius uses kubevirt VMs (i.e. VMs on k8s, or if they’re making a managed Soperator cluster its slurm-on-VMs-on-k8s). Crusoe uses cloud-hypervisor VMs (no k8s involved). Fluidstack takes what you give them.
It’s interesting that there isn’t a settled best practice here, just tradeoffs.
Kubernetes for Training
We are seeing more k8s for training, but nothing is simple yet. Everything still kind of sends you to yaml hell. Kueue, Volcano, PyTorchJob, MPIOperator, KubeFlow, Jobset, Trainy, Skypilot… somebody fix this
GB200 NVL72 Reliability and SLA’s
Probably the most active topic for providers. The reliability of NVL72 or NVL 36x2 (shown below) is really hard to contend with. Also NVIDIA really doesn’t want to talk about the 168 ACC cables that run between the racks, or firmware version 1.3 that just shipped a few weeks ago to fix some issues that have been ongoing for 6-7 months.
We also compared some SLAs that providers are settling on, for the node, rack and control plane level when selling this racks. It’s really crazy that customers basically have no choice but to accept "one 9" of uptime in their SLA.
To quote a friend that works in the industry "anyone quoting 72 out of 72 GPUs available for 99% uptime is insane and definitely losing money".
InfiniBand Security
After a bunch of pressure, NVIDIA finally released a blog on this topic, explaining how to (for lack of a better description) make a VLAN on InfiniBand.
But it’s not that simple. To make a partition on IB, you need to set a bunch of keys, and in our tests providers just… don’t. I got a bunch of 2 and 4 node clusters with and expected 16 or 32 endpoints in my partition, but I could run sudo ibhosts or grep the local ibdiagnet2.pkey file and see 500+ endpoints.
Please set a per-tenant P_Key, M_Key, VS_Key, C_Key, N2N_Key, SA_Key and AM_Key (if using SHARP)
Container Escapes, Embargo Programs, Pentesting and Auditing
Security is a checkbox and annoying, but when @wiz posts about CVE-2024-0132 in February and CVE-2025-23266 (9.0 CVSS) in July we expected providers to be able to notify their users, schedule a maintenance window, and apply the patch.
But when we explicitly tell providers about these exploits, share that we’re going to check for them, and then still get a 9 month old version of nct that is vulnerable, it’s annoying.
We recommend for all providers to join NVIDIA’s embargo program so that they’re prepared when something like this inevitably happens again. We’re also glad that AMD has established a similar embargo program for their neocloud partners following our feedback.