Follow @Locuza_

12,399 views

Locuza

Follow @Locuza_

, 20 tweets, 5 min read

My Authors

I think even for a speed rambling video the time is too short with Ampere's presentation coming in less than three hours which is why a picture thread with my thoughs will follow.
Ampere vs. Big Navi.🔥
1/x

@_rogame

@_rogame

The specs with 5248 "CUDA cores" are already out there for the GTX3090.
@_rogame found the configuration of Navi21 from driver files, confirming that 40WGPs/80CUs/5120 "cores" will be used.
Bringing both close together in terms of FP32 throughput
2/x

If 84 SMs are the maximum configuration of the GA102 chip then 6 GPCs are fitting, with 14SM each.
With 6 GPCs we have 6 Rasterizer/Scan Converter = 96 Pixels per clock.
ROPs are tied to Memory Controllers, with 384-Bit we have 96 ROPs = 96 Pixels/clock
3/x

Nvidia has a nice 1:1 ratio between Raster Frontend and Raster Backend.
Big Navi is different.
With 8 Rasterizers it can deliver 128 Pixels/clock but the Pixel Backend can just output 64 Pixels/clock.
That configuration is also not too uncommon for AMD.
4/x

Multiple AMD chips had a 2:1 ratio, like Tonga or Polaris10.
In practise many triangles will not cover 16 Pixels (which is the optimal case for one Rasterizer), so the Frontend is unlikely to ever deliver 128 pixels/clock.
I guess in practise that's not a terrible imbalance
5/x

For RGBA16 (16-Bit per Channel/64-Bit total) the Backend could output 819.2 GB/s raw.
Close to what the memory system should be able to deliver.
For RGBA8 you would be ROP limited.
I'm really not sure if backpressure will be a think with Navi21.
6/x

In regards to Ampere I'm curious about the architecture.
Will it be 6 GPCs?
And how will one SM look like?
Volta has 128KB L1$/LDS per 64 ALUs, Turing reduced that to 96KB L1$/LDS and halfed the cache bandwitdh from 128 bytes to 64 + the amount of Load/Store units.
7/x

A100 for the Data Center has 192KB L1$/LDS, will Gaming Ampere keep that or reduce it to 128KB or even less?
I believe Nvidia will increase the size to at least 128KB in comparison to Turing, which is what I drew in the diagram.
But will it keep the lower bandwidth?
8/x

With the 7nm budget a larger cache subsystem should be affordable and keeping more data local is always nice, also for Raytracing.
I guess that at least 128KB L1$/LDS are also good for CUDA compability, since Turing couldn't support all kernels for Volta.
9/x

The L2$ question is also open.
Nvidia went ponkers with GA100, having 2x24MB global L2$.
Navi10 uses 4MB under 7nm with a 256-Bit Interface.
Navi21 could just scale it to 6MB.
Nvidia could keep it for GA102 at 6MB like TU102.
10/x

Someone may make the jump to 9 or 12MB L2$ but for gaming variants I kept it conservative with 6MB.
I suspect the DRAM system for Navi21 will be 384-Bit with 16Gbps G6 chips, potentially up to 18Gbps, Samsung at least announced such speeds for G6.
11/x

With GDDR6X @ 19.5Gbps Nvidia will have the bandwidth advantage by at least 8%, in the worst case for Navi21 with 16Gbps by 22%.
Such "fine" details are important of course but by and large in regards to FP32 and Bandwidth I see them being rather close together.
12/x

Which brings me to the largest differences between them, where I see Nvidia shining and AMD being the secondary option.
It's Ray Tracing and Deep Learning power.
If the Tensor Cores are the same as in GA100, the throughput will be immense.
13/x

Big Navi has dot-instructions for 4xINT8 or 8xINT4 throughput, meaning ~65.5 TeraOPs total INT8 @ 1.6 GHz.
GA102 could do 537.4 TOPs INT8 over the Tensor Core @ 1.6 GHz -> 8.2x faster and if Nvidia can sustain Tensor Cores and Shader Cores in parallel, it looks really bad.
14/x

DLSS 2.0 grows to a real killer feature.
It makes a very good impression in Death Stranding and while not perfect the results are incredible.
AMD currently has nothing comparable.
If AMD will offer a software solution over DirectML you can expect the scope to be smaller.
15/x

Max throughput is a lot lower than on Nvidia and INT8/INT4 computations have to run on the shader core, Nvidia can do it in parallel.
FP16 is super fast, TF32/BF16 could be supported.
With the sparsity feature it get's even crazier.
16/x

It could be an area where Nvidia makes some cuts on the gaming variant.
Less units and less features, because that hardware is of course not coming for free.
But even if Nvidia would half it, the throughput and capabilities would still vastly overshadow Navi21 abilities.
17/x.

On Hot Chips Microsoft shared some RDNA2 details.
Basically confirming a ray tracing patent from AMD, which extended the texture mapping units with an intersection engine for ray tracing computations.
But it doesn't work in parallel.
You do Texture ops or Ray Ops.
18/x

I don't think there are implementation details out there for Nvidia's Ray Tracing solution but looking at the rendering graphs from Nvidia I believe the ray tracing hardware can work in parallel.
Developers should be able to confirm/deny it, it should be easy to test.
19/x

Even if absolute performance and perf/watt is the same between Ampere and RDNA2, I think buyers will get with better products from team Green.
There is simply a better featureset behind and Nvidia's software ecosystem is a strong backbone.
AMD still has to catch up a lot.
20/20

Try unrolling a thread yourself!

More from @Locuza_ see all

Embed code for your website

Did Thread Reader help you today?