When we (@awscloud) started building EFA, our fast network fabric for HPC, I was skeptical whether we'd be able to run the really hard "latency-sensitive" codes like weather simulations or molecular dynamics. Boy was I wrong. Turns out: we rock at these codes. 1/16
We learned, tho, that single-packet latency is really distracting when trying to predict code performance on an HPC cluster.

Don't misunderstand me tho: it's not irrelevant. 2/16
It's just that so many of us in HPC ignored the real goal which is for MPI ranks on different machines to exchange chunks of data quickly - we mistook single-packet transit times as a proxy for that. 3/16
But *real* applications send big chunks of data to each other, and what governs "quickly" is whether the hundreds, or thousands of packets that make up that data exchange arrive _in tact_. 4/16
On any normal network fabric (ours included) packets get lost or blocked by transient hot spots or congestion. That leads to tx/rc pairs having to spend cycles (+100s of microseconds) figuring out that a packet needs to get sent again. 5/16
Most fabrics (like IB) and protocols (like TCP) send the packets in order, like a conga line (cue the cha cha music). That means a single packet getting lost (v common) screws things up for all the packets behind it. 6/16
You can see why single-packet latency matters for these fabrics - it's *literally* going to make all the difference to how fast they can recover from a lost packet. 7/16
But, the measure we *should have been looking at* was _p99 tail latency_, because it's the net result of all those lost packets and retransmits and it's the one that impacts MPI codes the most: MPI codes are only as fast as the slowest rank. 8/16
We built our own reliable datagram protocol and relaxed things like in-order delivery, in the belief that if it's totally necessary we can re-assert it in the higher layers of the stack (and we were right). p99 tail latency _plummeted_ (by ~10x). Why? 9/16
Without the conga line model, SRD can push all the packets *at once* over all the possible pathways (in practice, ~64 at a time from the 1000's available). Radical changes in performance. 10/16
As a customer's app scales to even more nodes, SRD has more paths to pick from (we have a really large and pretty complicated network, so the # of paths is mind-blowing at times). 11/16
Customers get all this for free (in terms of code complexity) because we pack it all in our libfabric provider, and higher layers like Open MPI, Intel MPI and MVAPICH "just work". And HPC codes like WRF and GROMACS who live on top of MPI "just work", too. 12/16
So it's really pleasing to see performance disclosures like the WRF one last week (aws.amazon.com/blogs/hpc/nume…) from Karthik and Matt that show this happening for real with a really hard problem. 13/16
Ditto the results for GROMACS that Austin ran. (aws.amazon.com/blogs/hpc/grom…). 14/16
We're not fans of starting with a technology and trying to squeeze it into a solution. And that's why we came up with SRD and EFA. 15/16
We worked backwards from an essential problem (MPI ranks need to exchange lots of data quickly) and that meant a different solution for our circumstances. In the quest for performance, there's more than one way to skin a cat. /ENDS 16/16

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Brendan Bouffler☁️

Brendan Bouffler☁️ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(