Dan Ernst Profile picture
May 31 25 tweets 5 min read
As promised: Let's clarify what GPU system "Peak" performance means, a🧵

Let's start with three definitions:
1) Maximum Architectural Peak Rate is the number of FP64 Function Units (FUs) that can run in parallel multiplied by the maximum clock rate the chip is capable of. Note that this number need not be achievable under typical/any circumstances. (see: )
2) Power-Constrained Peak Rate (PCPeak) is the number of FP64 FUs that can run in parallel multiplied by the maximum clock rate achievable for that operation type under a given socket power limit (TDP).
For example, clock rates may be reduced vs architectural max when all FP64 units are operating in order to stay within the chip power envelope. This number may differ by which operation you're measuring.
3) Sustained Achieved Rate is the actual number of FP64 operations that can be achieved per second by software (e.g. HPL Rmax for Top500 purposes) running on the chip under the TDP.
Ok, with the tutorial out of the way, here's a slide from Lisa Su's MI200-series reveal a few months back. Which one of these three are being represented here?
(thanks for preserving this @Patrick1Kennedy)
If you answered "Architectural Peak Rate", you're correct. The nice multiples of 2 are a giveaway.

What would that mean for Frontier, assuming you could use the FP64 Matrix ops for HPL (you can)? 36,992 GPUs * 95.7 TF -> 3.54 EF. So why is Frontier only listed as 1.6 EF Rpeak?
First, let's talk about how GPUs operate. Contrary to pop culture, there isn't a default/standard/typical clock speed in power-constrained systems (aka all systems these days).
GPUs use sophisticated control systems to detect what kind of operations are currently executing (and the resulting power draw) and then adjust the clocks to get the highest performance possible without "going over" and exceeding the sustainable power draw of the part.
This is effectively a generalization of what Nvidia has traditionally called "Boost Clock": nvidia.com/en-us/geforce/…
Another way to think of this is that each of those rows in Lisa's slide *runs at a different clock rate*. The most power-hungry operations could achieve a lower percentage of those max architectural numbers.
An obvious example is that the clock speed for FP64 will likely be much lower than the clock for a similar number of FP32 units. The clocks for matrix units will be significantly lower than vector/scalar units, but more than make up for it in FPUs and will thus be more efficient.
For Frontier, we have architectural numbers (the slide) and we have the achieved rate (Rmax), so it's logical that the Top500 Rpeak is the Power-constrained peak. If you back-calculate it out per GPU, it comes to about 45.6 TF. This is likely for Matrix FP64 (most efficient).
It likely follows that the Vector FP64 PCPeak is somewhat lower, but - to be clear - it is likely well more than 1/2 of the Matrix FP64 as it will run at a higher clock in sustained operation.
Listing the PCPeak as Rpeak is absolutely the right answer in my opinion - and a natural result of the disconnect I called out back when Skylake and AVX-512 clocks were giving us head-scratching efficiencies.
The architectural peak numbers have no meaning in sustained useful operation (outside perhaps some very fleeting burst cases) as any sustained use at that rate would out-draw the power train it's attached to.
Side note: To put in perspective how much they don't matter, I "owned" the numbers of record for this part a couple of years ago at Cray, and the Lisa Su presentation was the first time I ever saw the architectural 95 TF number. It doesn't impact applications in the least.
Also, for the record, Frontier has always been listed as "> 1.5 EF" in all official material, which it is comfortably over, even using PCPeak. Unless something has changed, you should assume the same with other official numbers that are out with respect to Cray-primed systems.
At the chip level. this could get fuzzier if a different system could be built for a given chip to deliver/cool more power, capturing a higher percentage of the arch peaks. However, once a system is in place (e.g. Frontier), the operating parameters are pretty well locked.
There *is* a somewhat reasonable technical excuse for using arch rates for pre-announcements - it turns out estimating PCPeak *before you have final silicon* is very challenging. That still doesn't really make it useful to anyone estimating performance of software.
So this has largely been an AMD discussion thus far. Do the other GPU companies (Nvidia, Intel) do this? The answer is absolutely yes - and the fact they do so likely influenced AMD's decision to do the same. (See my earlier complaint...)
For example, you can guess at the math for PVC/Aurora at this point, if you'd like. At HotChips, Intel stated the architectural DP peak is 45 TF. Assuming they are targeting 2 EF FP64 PCPeak with ~10k nodes, they would need to hit 33 TF.
At this point, that's about as far as one can extrapolate until there's public/non-NDA PVC hardware for someone to test. 🦄

(Disclaimer: I had very little interaction on the Aurora program so this is all sourced from public information).
So, that's the story on how GPU performance is marketed/reported. I hope you all learned something. @nicholasmalaya should definitely check my math. 😛

So what's the REAL peak performance of Frontier?

It's 121 PB/s 🥰
Addendum: @TDaytonPM or other journos feel free to use/ask w/attrib if this helps your articles. As a bonus, any questions you send me tomorrow morning might get really interesting answers since I'll be drugged up after getting my wisdom teeth pulled. 🤪

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dan Ernst

Dan Ernst Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ernstdj

May 14, 2020
1/ How should we talk about DP FLOPS for A100? A thread. #HPC #AI #Nvidia
2/ Nvidia announced A100 this morning and you can read about it from any number of outlets, e.g. nextplatform.com/2020/05/14/nvi…
3/ Nvidia, as usual, have themselves published a great set of detailed blogs about the architecture and specs, really diving into the approaches they've taken and how they might impact workloads.
devblogs.nvidia.com/nvidia-ampere-…
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(