A lot of folks have asked about what we used to build out the infrastructure for the hackathon. 1/22
I'm going to lean on @ollyperksHPC and @CQnib and others who worked really closely to look after both the orchestration of the clusters themselves as well as the "insides" of the clusters (packages, compilers, libraries etc). 2/22
Firstly, we extensively used @awscloud #ParallelCluster, which takes spec files that more or less say - "gimme a cluster with, lemme see ... I think I want Slurm today, and up to 16 compute nodes made from Graviton2, I also want EFA for doing fast MPI stuff, oh, and a… 3/22
…500 GB Lustre filesystem, better get DCV for secure remote desktops, too oh, and a side of 🍟"). It turns out a cluster in about 5-10 minutes built to those specs, which is magic. 4/22
The options in ParallelCluster are pretty crazy and that was most of the goodness we needed for the event. 5/22
If you've ever been through an RFP process to design and build a cluster to those specs, you'll appreciate that doing it in 5-10 minutes is more than kinda cool (except for the 🍟 - they're usually pretty quick). 6/22
In all, we spun up 61 ParallelCluster clusters all over the world in several AZs in 4 different AWS regions aiming to keep the clusters close to the people using them, so keyboard latency would be minimal. 7/22
We also wanted each team to have their own infrastructure, and we wanted them to have a cluster from each CPU flavor (one Arm-based Graviton and one x86). 38,000 cores in all. 8/22
We chose to do it this way because we wanted to give each team their own sandpit which elastically responded to their own needs. Most teams didn't take long to get used to clusters that grew or shrunk when they needed: "weird at first, but ... seriously very, very cool". 9/22
This is one of the driving reasons people come top the cloud to do #HPC in the first place: they don't want to get log-jammed behind other people. 10/22
We really believe that *the most expensive item* in any HPC data center is the human trying to use the damn thing - not the machinery. So clearing the bottlenecks for humans to work productively is task #1. 11/22
However, there are some nerdy things to admire: the c6gn instances all used Elastic Fabric Adapter - that's the 100 Gbit/s fast networking fabric built on SRD (more on that here: hpc.news/srdblog). The CPUs were our Graviton2's (our Arm-based processor). 12/22
They are an *absolute beast* for memory-bandwidth hungry applications. 13/22
And with 64 cores in a single socket, the simplicity of the data locality model makes for some very speedy results, like WRF: which is interesting because 14/23
... the code ran exactly the same, node for node as the x86 version, but ... um ... way cheaper). 15/23
On the inside, every cluster came up with @arm's performance libraries installed, plus GCC, Arm's own Compiler and the NV HPC Compiler (thank you, @NVIDIAHPCDev !). 16/23
Lotsa math libraries too: ArmPL, OpenBLAS, Bliss, BYO!, and the icing on the cake was Arm Forge which is the top-shelf profiling suite that all the big guys use in their supercomputing centers. 17/23
The hackathon challenge was to update @spackpm recipes and create @reframehpc scripts so we could build CI/CD pipelines. 18/23
We directed the output from Slurm and reFrame to GrayLog (graylog.org) which made it super easy to track activity, progress and make pretty graphs of things. 19/23
The entire deployment was scripted and templated, and - in fact - in an unanticipated Sunday night 2am 'rehearsal', we re-created the whole lot in around 25 minutes (for reasons that had nothing to do with technology and everything to do with us being humans). 20/23
During the event our hack-ops team regularly deployed event-wide or cluster-wide updates, extra packages or just scaled up limits to compute fleets - usually within a few minutes of being asked in one of the many @slackHQ channels we used as virtual help desks. 21/23
All in all, it was close to a textbook HPC-in-the-cloud thing. We learned a bunch from it, tho, because every time you put a piece of software to a scale test like this, you get data. 22/23
We've come up with a few ideas we'll be discussing with our dev teams over the summer and helping to create some working backwards docs to turn those ideas into reality. 23/23

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Brendan Bouffler☁️ 🏳️‍🌈

Brendan Bouffler☁️ 🏳️‍🌈 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @boofla

21 Jul
Drum-roll, please: @OllyPerksHPC and I just announced the winners of the @awscloud/@arm Summer Cloud HPC #ahugHackathon hosted by @ArmHPCUserGroup. 1/17
The judges were really impressed by the incredible effort from across the community that saw the participants get 31 codes working on Graviton2-based #HPC clusters that previously only built and ran on other, mostly x86-based, lifeforms. 2/17
This lifts the water level for the whole Arm-HPC community. 3/17
Read 18 tweets
13 May
When we (@awscloud) started building EFA, our fast network fabric for HPC, I was skeptical whether we'd be able to run the really hard "latency-sensitive" codes like weather simulations or molecular dynamics. Boy was I wrong. Turns out: we rock at these codes. 1/16
We learned, tho, that single-packet latency is really distracting when trying to predict code performance on an HPC cluster.

Don't misunderstand me tho: it's not irrelevant. 2/16
It's just that so many of us in HPC ignored the real goal which is for MPI ranks on different machines to exchange chunks of data quickly - we mistook single-packet transit times as a proxy for that. 3/16
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(