Tweet

Brendan Bouffler☁️

Follow @boofla

13 May, 16 tweets, 3 min read

@awscloud

When we (@awscloud) started building EFA, our fast network fabric for HPC, I was skeptical whether we'd be able to run the really hard "latency-sensitive" codes like weather simulations or molecular dynamics. Boy was I wrong. Turns out: we rock at these codes. 1/16

We learned, tho, that single-packet latency is really distracting when trying to predict code performance on an HPC cluster.

Don't misunderstand me tho: it's not irrelevant. 2/16

It's just that so many of us in HPC ignored the real goal which is for MPI ranks on different machines to exchange chunks of data quickly - we mistook single-packet transit times as a proxy for that. 3/16

But *real* applications send big chunks of data to each other, and what governs "quickly" is whether the hundreds, or thousands of packets that make up that data exchange arrive _in tact_. 4/16

On any normal network fabric (ours included) packets get lost or blocked by transient hot spots or congestion. That leads to tx/rc pairs having to spend cycles (+100s of microseconds) figuring out that a packet needs to get sent again. 5/16

Most fabrics (like IB) and protocols (like TCP) send the packets in order, like a conga line (cue the cha cha music). That means a single packet getting lost (v common) screws things up for all the packets behind it. 6/16

You can see why single-packet latency matters for these fabrics - it's *literally* going to make all the difference to how fast they can recover from a lost packet. 7/16

But, the measure we *should have been looking at* was _p99 tail latency_, because it's the net result of all those lost packets and retransmits and it's the one that impacts MPI codes the most: MPI codes are only as fast as the slowest rank. 8/16

We built our own reliable datagram protocol and relaxed things like in-order delivery, in the belief that if it's totally necessary we can re-assert it in the higher layers of the stack (and we were right). p99 tail latency _plummeted_ (by ~10x). Why? 9/16

Without the conga line model, SRD can push all the packets *at once* over all the possible pathways (in practice, ~64 at a time from the 1000's available). Radical changes in performance. 10/16

As a customer's app scales to even more nodes, SRD has more paths to pick from (we have a really large and pretty complicated network, so the # of paths is mind-blowing at times). 11/16

Customers get all this for free (in terms of code complexity) because we pack it all in our libfabric provider, and higher layers like Open MPI, Intel MPI and MVAPICH "just work". And HPC codes like WRF and GROMACS who live on top of MPI "just work", too. 12/16

So it's really pleasing to see performance disclosures like the WRF one last week (aws.amazon.com/blogs/hpc/nume…) from Karthik and Matt that show this happening for real with a really hard problem. 13/16

Ditto the results for GROMACS that Austin ran. (aws.amazon.com/blogs/hpc/grom…). 14/16

We're not fans of starting with a technology and trying to squeeze it into a solution. And that's why we came up with SRD and EFA. 15/16

We worked backwards from an essential problem (MPI ranks need to exchange lots of data quickly) and that meant a different solution for our circumstances. In the quest for performance, there's more than one way to skin a cat. /ENDS 16/16

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Brendan Bouffler☁️

Try unrolling a thread yourself!

Did Thread Reader help you today?

Like this author's thread?