Tweet

Brendan Bouffler☁️ 🏳️‍🌈

Follow @boofla

22 Jul, 23 tweets, 6 min read

A lot of folks have asked about what we used to build out the infrastructure for the hackathon. 1/22

@ollyperksHPC

I'm going to lean on @ollyperksHPC and @CQnib and others who worked really closely to look after both the orchestration of the clusters themselves as well as the "insides" of the clusters (packages, compilers, libraries etc). 2/22

@awscloud

Firstly, we extensively used @awscloud #ParallelCluster, which takes spec files that more or less say - "gimme a cluster with, lemme see ... I think I want Slurm today, and up to 16 compute nodes made from Graviton2, I also want EFA for doing fast MPI stuff, oh, and a… 3/22

…500 GB Lustre filesystem, better get DCV for secure remote desktops, too oh, and a side of 🍟"). It turns out a cluster in about 5-10 minutes built to those specs, which is magic. 4/22

The options in ParallelCluster are pretty crazy and that was most of the goodness we needed for the event. 5/22

If you've ever been through an RFP process to design and build a cluster to those specs, you'll appreciate that doing it in 5-10 minutes is more than kinda cool (except for the 🍟 - they're usually pretty quick). 6/22

In all, we spun up 61 ParallelCluster clusters all over the world in several AZs in 4 different AWS regions aiming to keep the clusters close to the people using them, so keyboard latency would be minimal. 7/22

We also wanted each team to have their own infrastructure, and we wanted them to have a cluster from each CPU flavor (one Arm-based Graviton and one x86). 38,000 cores in all. 8/22

We chose to do it this way because we wanted to give each team their own sandpit which elastically responded to their own needs. Most teams didn't take long to get used to clusters that grew or shrunk when they needed: "weird at first, but ... seriously very, very cool". 9/22

This is one of the driving reasons people come top the cloud to do #HPC in the first place: they don't want to get log-jammed behind other people. 10/22

We really believe that *the most expensive item* in any HPC data center is the human trying to use the damn thing - not the machinery. So clearing the bottlenecks for humans to work productively is task #1. 11/22

However, there are some nerdy things to admire: the c6gn instances all used Elastic Fabric Adapter - that's the 100 Gbit/s fast networking fabric built on SRD (more on that here: hpc.news/srdblog). The CPUs were our Graviton2's (our Arm-based processor). 12/22

They are an *absolute beast* for memory-bandwidth hungry applications. 13/22

And with 64 cores in a single socket, the simplicity of the data locality model makes for some very speedy results, like WRF: which is interesting because 14/23

... the code ran exactly the same, node for node as the x86 version, but ... um ... way cheaper). 15/23

@arm

On the inside, every cluster came up with @arm's performance libraries installed, plus GCC, Arm's own Compiler and the NV HPC Compiler (thank you, @NVIDIAHPCDev !). 16/23

Lotsa math libraries too: ArmPL, OpenBLAS, Bliss, BYO!, and the icing on the cake was Arm Forge which is the top-shelf profiling suite that all the big guys use in their supercomputing centers. 17/23

@spackpm

The hackathon challenge was to update @spackpm recipes and create @reframehpc scripts so we could build CI/CD pipelines. 18/23

We directed the output from Slurm and reFrame to GrayLog (graylog.org) which made it super easy to track activity, progress and make pretty graphs of things. 19/23

The entire deployment was scripted and templated, and - in fact - in an unanticipated Sunday night 2am 'rehearsal', we re-created the whole lot in around 25 minutes (for reasons that had nothing to do with technology and everything to do with us being humans). 20/23

@slackHQ

During the event our hack-ops team regularly deployed event-wide or cluster-wide updates, extra packages or just scaled up limits to compute fleets - usually within a few minutes of being asked in one of the many @slackHQ channels we used as virtual help desks. 21/23

All in all, it was close to a textbook HPC-in-the-cloud thing. We learned a bunch from it, tho, because every time you put a piece of software to a scale test like this, you get data. 22/23

We've come up with a few ideas we'll be discussing with our dev teams over the summer and helping to create some working backwards docs to turn those ideas into reality. 23/23

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Brendan Bouffler☁️ 🏳️‍🌈

Try unrolling a thread yourself!

More from @boofla

Brendan Bouffler☁️ 🏳️‍🌈

Brendan Bouffler☁️ 🏳️‍🌈

Did Thread Reader help you today?

Like this author's thread?