I finished it (!), 9 hours before AMD will officially present CDNA2/MI250X 🥳
It's basically the second rambling/analysis part for Aldebaran, going over some changes based on driver and compiler patches from AMD.
It's a technical mini spoiler, perhaps?
1/x
Disclaimer, I put that together in a short amount of time, there might be quite a few issues.
________
Because of the 110 CU notion from AMD's driver, it appears obvious to me that Aldebaran is not using 16 CUs per SE, but likely only 14 --> smaller chiplet size.
2/x
According to the patches, on Aldebaran always 2 CUs share 1x 32KiB I$ and 1x 16KiB K$.
1st image shows how the shader array is build on Vega10, Arcturus (CDNA1) and Aldebaran (CDNA2)
____
Aldebaran has a prefetch depth of 16 cache lines.
It's only 3 on RDNA/GFX10.
3/x
It's a bit funny that only Aldebaran and GFX10 are mentioned, instruction prefetch was a bullet-point for GCN4/Polaris.
Besides I$/K$, L1D$ is still just 16KB small, 8MB for the L2$ and LDS also should be just 64KB.
That's not so great, if it really ends up this way.
4/x
What is great, is the redesigned register file.
Instead of 64KB VGPRs + 64KB AGPRs (not useable by vector SIMD), CDNA2 will have a unified 128KB reg file.
The split is programmable and the vector SIMD should be able to access it fully if no matrix ops are used.
5/x
AMD changed the VGPR allocation granularity from 4 to 8, they did the same from RDNA1 to RDNA2.
I guess that's a better trade-off and over-allocation does not happen often.
6/x
The vector SIMDs got a huge upgrade.
SP/DP implementation is a fascinating topic with unfortunately not so accessible information.
My interpretation, AMD is using mixed-precision ALUs for FP32 and FP64 ops.
Nvidia uses separate FP64 units, trading area for better efficiency.
7/x
IIRC VLIW5 was the first hw for AMD with FP64 support, FP64:FP32 ratio was 1:5 (not on all GPUs).
On GCN it's configurable from 1:2 (Half-Rate) to 1:16.
That's the limit since 2014 with Hawaii (FirePro).
Why no 1:1/Full Rate option?
I remember...
8/x
...some old forum discussions on @3DCenter_org, arguing that full-rate is not such a great idea for any market which also needs FP32, because with a bit of extra logic, FP64 hw can process FP32 faster, so FP64 full-rate would be rather FP32 slow-rate.
9/x
It's a subject beyond my layman's knowledge.
I would love more information in that regard and how it really looks like.
Anyway, CDNA2 comes with full-rate FP64 support. ;)
Also 64 Data Parallel Primitives for cross lane operations.
10/x
My simplified view on this is that the internal bit-width of the ADD/MUL logic was increased.
I simply describe it as 64-Bit lanes vs. 32-Bit lanes previously.
It may not be really accurate but I think it checks out on how it effectively works.
11/x
AMD is going to keep 4-cycle issue and not building a different execution model.
So I think the SIMD units still have 16 lanes and you can either process 1x FP64 or 1xFP32 or 1x FP16 op.
Same throughput for all, unless you use packed math instructions...
12/x
GCN3/4 only had same rate FP16 ops, GCN5 brought packed-math for FP16.
2x FP16 inputs are loaded and the same operation applied to both, doubling the throughput while being still executed by only one SIMD lane.
CDNA2 uses the same trick for FP32, new are packed FP32 instr.
13/x
Driver patches don't mention something like quad-packed FP16, so packed FP32 and packed FP16 ops have now the same theoretical throughput.
In comparison to CDNA1 you get twice the FP64 throughput and FP32 throughput.
Unified reg files comes in handy to feed this.
14/x
Now the Matrix v2 units did not get a huge upgrade.
It's my understanding that the BF16 rate was doubled, making it equal to FP16 throughput, which was not the case on CDNA1.
A new capability is matrix FP64 support.
Half-rate vs. FP32.
Otherwise apparently no upgrades.
15/x
No TF32 support or even higher BF16, FP16 or INT8 throughput.
In fact you should get now the same FP32 and FP64 throughput on vector or matrix units.
It's clear that AMD's focus was on the vector side, with rather little matrix innovations.
16/x
Now the money slides...
Nvidia A100 vs. AMD MI100 & MI250X vs. Intel Ponte Vecchio.
Let's start with the good things on paper.
FP64 throughput is very high, competitive with PVC unless Intel significantly increases clocks.
A0 PVC silicon runs with ~1.4 GHz.
With FP32 the MI250X is the king of the hill.
Intel has full-rate FP64 but no packed-math for FP32, AMD wins basically by factor 2.
AMD is also the only one who supports matrix FP32 ops.
_
Now the bad things, AMD is clearly not focused on low-precision throughput.
18/x
Intel wins that paper battle by factor 2 (FP/BF16).
MI250X is beating the A100, but it's using more silicon and has a higher power rating, not to mention that A100 is on the market since a long time.
For matrix INT8 applications MI250X gets beaten badly.
19/x
What looks in general terrible on Aldebaran is the memory subsystem.
Per CU AMD is still using a tiny 16KB L1D$ and 64KB for scratchpad memory, combined 80KB worth of storage.
A100 uses 192KB per SM, it's configurable and can support larger L1$ or scratchpad sizes.
20/x
Intel goes bonkers with 512KB per Xe Core.
In total you have ~17.19MB on MI250X, 20.25MB on A100 and 64MB on PVC.
And then there is the L2$.
A100 has 40MB L2$ (48 physically), Aldebaran according to driver patches just 8MB per chiplet, 16MB in total.
Intel -> 2x144MB L2$..
21/x
I think AMD will heavily rearchitect the cache hierarchy on RDNA3 and CDNA3.
L1$ & LDS have to be unified or enlarged, the LLC is great on RDNA2 and will get better on RDNA3, but
CDNA3 has to get it too.
I'm not sure if CDNA2 has surprises in store for today (cache) ?
22/x
To some extent this paperwork doesn't really matter. Real product performance and business deals do.
AMD got "at least" the first western exascale deal with Frontier, great achievement.
One big reason was the coherent link between CDNA2 and Trento (custom Milan).
23/x
Device memory is cachable by Trento.
However, I wonder if it's already using the 3rd Gen Infinity Architecture or some pre-form of it?
There are many other questions, like who gets access to Trento?
Who has to wait for Genoa for a coherent link between CDNA2+ and CPUs?
24/x
In 7 hours we hopefully will get some answers.
Before I forgot to mention it, CDNA2 supports a threadgroup split mode, bypassing L1$/LDS memory to schedule waves from a workgroup on any compute unit.
I don't know what the specific application field of that would be.
25/25
• • •
Missing some Tweet in this thread? You can try to
force a refresh
The N31 reveal got a couple of big surprises, in both good and bad ways.
A good surprise was AMD sharing die shots of Navi31, the GCD and MCD dies!
I took a first look at Navi31, which due to the usual pixel mess is simple and may include misinterpretations.
1/x
Awkward pause, but a few things got me thinking, and I checked a few things.
In addition, I have little time, so I have to fire those semi-random thoughts quickly.
____
Because RDNA3 has no legacy pipeline anymore, you would expect less geometry processing hardware.
The...
2/x
command frontend in the past likely had the central geometry processor.
The area for the whole section took quite a bit more on N22 than the 1.5 MB L2$.
On Navi31 that section looks a lot smaller in relative terms.
I would love to know if everything is now compute emulated?
The short Zen 4 die shot analysis is now freely available on YouTube!
Key slides and points will be also included in this Twitter thread, while the text version on Patreon and Substack will stay for paid subscribers only.
Well, actually I'm going to bed soon and may finish the Twitter thread later.
However, let's start with something.
1. Die sizes based on a package photo, a rendering of it and AMD's official product page listings.
Somebody with access should measure it directly with a caliper.
2. The Zen 4 die shot of the compute die which AMD included in their livestream presentation, stretched to the correct proportions and sharped.
Not perfectly done, but enough for the first overview.
A few highlights and extras will be mentioned in this Twitter thread.
1/x
Alder Lake-S is the first consumer chip which brought PCIe 5 support and it's always interesting to see how a new standard looks on a die shot.
With twice the transfer speed, are the PHYs larger?
AMD has PCIe3&4 blocks on basically the same node, which share the same size.
2/x
There are multiple PHY blocks supporting the same PCIe version with fairly large size differences.
Since the intrinsic scaling for analog devices with a new process node is small to non-existent, the size should be influenced by the block and packaging design (I/O density).
This video includes what caught my eye after skimming through the open source driver patches for RDNA3.
It goes over IP versions, some feature definitions, FSR code lines & more.
This table compares the version number of the IP blocks used in AMD RDNA1, 2, and 3 discrete GPUs.
APUs and custom consoles are not taken into account.
Some IP blocks there have another major.minor.revision number.
Like Rembrandt uses SMU13.0.1 and VCN3.1.1.
2/x
Of particular interest is the Micro Engine Scheduler (MES) block, that is just now receiving sw support.
According to AMD's bridgman it is intended to replace the Compute Hardware Scheduler (HWS) and to provide hardware scheduling for the graphics queues for the first time.
I proudly present... another audio mess..., I mean the second part of the DG2 Alchemist analysis and discussion.
As usual, the main points will also be covered in this twitter thread.
1/x 🧵
Die sizes of N22, GA104 and DG2-512.
Actually, the GA104 is likely closer to 400mm² with the scribe lines.
It's hard to make a fair comparison.
Different process nodes with other design trade-offs, differences in spending like for display, matrix units, ray tracing, etc.
2/x
A couple of cool things on DG2.
DisplayPort2.0, with at least 4x Pipes on DG2-512.
AV1 encoding support, AMD&Nvidia will come a couple of months later with RDNA3&Ampere-Next.
FF transcoding speed and quality is already strong/best on Xe LP vs. Turing/Ampere NVENC Gen7
I tried really hard to not make a multipart video series again, but it ended up to be ~1 hour long...
I had to cut it, the first part is now online, stuff I worked on since August.
Well here it is, Intel's DG2 Alchemist vs. AMD N22 and NV GA104.
1/x 🧵
The first video part is only showing theoretical throughput comparisons and how Intel, AMD and Nvidia scale their GPU configurations.
But first a bit of history, in 1998 Intel released their first dGPU, the i740, and it would be the last one till DG1 in 2021...
2/x
Over the years, one vendor after another left the discrete graphics market, till we only got ATI/AMD and Nvidia graphics cards for nearly 20 years!
However, DG1 was not a liberation blow which got customers excited.
It's only available in pre-built systems with GT 1030 perf.