@_rogame found the configuration of Navi21 from driver files, confirming that 40WGPs/80CUs/5120 "cores" will be used.
Bringing both close together in terms of FP32 throughput
2/x
With 6 GPCs we have 6 Rasterizer/Scan Converter = 96 Pixels per clock.
ROPs are tied to Memory Controllers, with 384-Bit we have 96 ROPs = 96 Pixels/clock
3/x
Big Navi is different.
With 8 Rasterizers it can deliver 128 Pixels/clock but the Pixel Backend can just output 64 Pixels/clock.
That configuration is also not too uncommon for AMD.
4/x
In practise many triangles will not cover 16 Pixels (which is the optimal case for one Rasterizer), so the Frontend is unlikely to ever deliver 128 pixels/clock.
I guess in practise that's not a terrible imbalance
5/x
Close to what the memory system should be able to deliver.
For RGBA8 you would be ROP limited.
I'm really not sure if backpressure will be a think with Navi21.
6/x
Will it be 6 GPCs?
And how will one SM look like?
Volta has 128KB L1$/LDS per 64 ALUs, Turing reduced that to 96KB L1$/LDS and halfed the cache bandwitdh from 128 bytes to 64 + the amount of Load/Store units.
7/x
I believe Nvidia will increase the size to at least 128KB in comparison to Turing, which is what I drew in the diagram.
But will it keep the lower bandwidth?
8/x
I guess that at least 128KB L1$/LDS are also good for CUDA compability, since Turing couldn't support all kernels for Volta.
9/x
Nvidia went ponkers with GA100, having 2x24MB global L2$.
Navi10 uses 4MB under 7nm with a 256-Bit Interface.
Navi21 could just scale it to 6MB.
Nvidia could keep it for GA102 at 6MB like TU102.
10/x
I suspect the DRAM system for Navi21 will be 384-Bit with 16Gbps G6 chips, potentially up to 18Gbps, Samsung at least announced such speeds for G6.
11/x
Such "fine" details are important of course but by and large in regards to FP32 and Bandwidth I see them being rather close together.
12/x
GA102 could do 537.4 TOPs INT8 over the Tensor Core @ 1.6 GHz -> 8.2x faster and if Nvidia can sustain Tensor Cores and Shader Cores in parallel, it looks really bad.
14/x
It makes a very good impression in Death Stranding and while not perfect the results are incredible.
AMD currently has nothing comparable.
If AMD will offer a software solution over DirectML you can expect the scope to be smaller.
15/x
FP16 is super fast, TF32/BF16 could be supported.
With the sparsity feature it get's even crazier.
16/x
Less units and less features, because that hardware is of course not coming for free.
But even if Nvidia would half it, the throughput and capabilities would still vastly overshadow Navi21 abilities.
17/x.
There is simply a better featureset behind and Nvidia's software ecosystem is a strong backbone.
AMD still has to catch up a lot.
20/20