My Authors
Read all threads
1/ How should we talk about DP FLOPS for A100? A thread. #HPC #AI #Nvidia
2/ Nvidia announced A100 this morning and you can read about it from any number of outlets, e.g. nextplatform.com/2020/05/14/nvi…
3/ Nvidia, as usual, have themselves published a great set of detailed blogs about the architecture and specs, really diving into the approaches they've taken and how they might impact workloads.
devblogs.nvidia.com/nvidia-ampere-…
4/ There's a specific couple of lines in the overview that I think are worth talking about as an #HPC community. Here they are:
5/ A100's peak DP performance of 9.7 TF is a 30% uplift from V100's 7.5 TF, however this is coming at a socket-level power uplift of 33%. This is a 7nm chip versus V100's 12nm. So where did all the new capability go?

Specialization.
6/ A100 obviously has a ton of new interesting features (e.g. acceleration for structured sparsity, TF32) that are focused on delivering boosts to common data and compute patterns. The one I want to call out here is the expansion of the "Tensor Core" all the way to FP64.
7/ The new DP Tensor Core uses a "DMMA" operation to multiply two 2x4 FP64 matrix panels in a single instruction. Using this enhanced throughput, they produce a second performance number: 19.5 TF
Nvidia's math libraries will make use of this heavily. blogs.nvidia.com/blog/2020/05/1…
8/ Those familiar with microarchitecture will understand that this is a far more efficient way to do a matrix calculation than individual FMAs - you skip all the control flow of instruction issue while also avoiding the need for register writes/reads of all the intermediate sums.
9/ What is more impactful to things like #Top500 is that this DMMA acceleration works for HPL, but does narrow (further) the application space that can take advantage of that level of FLOPS. It's not something that can be captured currently in the #Top500 metrics.
10/ This is the nature of "late-Moore" specialization: resources are applied to subsets of workloads (in this case, dense matrix math) to accelerate them, while having limited value to others.
11/ It is important for #HPC - as a community - to be more detail-oriented and to keep track of what is being accelerated and what is being left behind as we move into more of this specialization. We really need a broader set of metrics and benchmarks to do that.
12/ P.S. a lot of people have leapt to thinking about "whole-chip" specialization in this era. A100 (and Intel x86 extensions, and SVE...) shows there's still a ton of room for applying specialization within existing cores and architectures in a way that makes adoption easy.
13/ P.P.S. congrats to Nvidia on launching another fascinating chip - and more importantly committing to doing the software lifting to make it useful for real #HPC and #AI workloads.
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Dan Ernst

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!