Locuza Profile picture
Oct 24, 2021 17 tweets 10 min read Read on X
I tried really hard to not make a multipart video series again, but it ended up to be ~1 hour long...
I had to cut it, the first part is now online, stuff I worked on since August.

Well here it is, Intel's DG2 Alchemist vs. AMD N22 and NV GA104.

1/x 🧵
The first video part is only showing theoretical throughput comparisons and how Intel, AMD and Nvidia scale their GPU configurations.
But first a bit of history, in 1998 Intel released their first dGPU, the i740, and it would be the last one till DG1 in 2021...

2/x
Over the years, one vendor after another left the discrete graphics market, till we only got ATI/AMD and Nvidia graphics cards for nearly 20 years!
However, DG1 was not a liberation blow which got customers excited.
It's only available in pre-built systems with GT 1030 perf.

3/x
But DG2 gets us excited!
This is now a serious product being deployed for a wide range of mobile and desktop products.
We can look forward to two different DG2 Alchemist chips (not powered by the Fullmetal Alchemist though 😜)

4/x
Based on Intel's architecture slide, the high-level blocks for DG2 appear to be very similar to Xe LP.
256-bit vector engines = SIMD8 Units.
Two vector engines are drawn as a pair, most likely sharing the thread control unit again .
New is a 1024-bit Matrix Engine.

5/x
There will be a table with many assumptions from my side, which may turn out to be wrong.
Like I would assume that Xe HPG supports the same data inputs as Xe HPC per matrix engine.
So INT8, FP16, BF16 and TF32.
I also assume that Intel has fast FP32 accumulation at the end.

6/x
Something which Nvidia doesn't for Turing and client Ampere.
So I'm ending up with the following throughput numbers per Render Slice, Shader Engine and Graphics Processing Cluster (per clock).
I find it cool to see how the companies roughly scale their GPUs.

7/x
Everything inside a table.
Per high-level module Intel could have twice the FP16/BF16/TF32 throughput if they have a higher execution rate with FP32 accumulate, if not, it's the same as on Nvidia.
On paper that would be still great of course.
(Do I hear sad red noises?)

8/x
As Nvidia's RT Cores, Intel's Ray Tracing Units do support BVH traversal.
Imagination would classify it as a Level 3 Ray Tracing solution.
AMD's Ray Accelerators do not support BVH traversal, only ray/box and ray/triangle intersections are accelerated = Level 2.

9/x
Currently it's not known how many ray intersections Intel's RT Units can compute, however, with BVH traversal Intel's RT performance should be quite a bit above AMD, even though the Unit count is lower.
Nvidia already shows strong performance on Turing vs. RDNA2 from AMD.

10/x
Now we can look at how many high-level modules are used per chip.
DG2-512 goes ham with 8 Render Slices.
N22 uses 2 Shader Engines (but also with the highest throughput per module)
GA104 uses 6 Graphics Processing Clusters.

11/x
A look at the comparison table (only per clock throughput) probably says that Navi22 does not belong there, it is a completely different class.
The 3D HW is by far the weakest, also FP32 throughput.

DG2-512 presents itself as a rasterization monster.

12/x
Obviously we have to consider the clock, which is a point where Navi22 excells.
For DG2 I took 1.8 GHz (mobile) and >2.2 GHz (desktop).
For AMD's N22 2.531 GHz (avg. clock under under 16 games at 1440p)
For GA104 1.878 & 1.920 GHz (avg. clock under 17 games at 1440p).

13/x
Now we have the final paper comparison.
Thanks to 2.5 GHz on N22, it came much closer to the GA104 wich runs with only 1.9 GHz.
Still, many would probably say that the GA104 should lead by a strong margin.
Rasterization and FP32 throughput is much higher.

14/x
But as many know, paper specs can be far away from real performance.
The 3070 is just 10% faster than AMD's 6700 XT on a benchmark run by Computerbase.
3070 Ti wins by 16%, but it goes berserk on power consumption, memory (GDDR6X) and price (on paper :P).
We likely will..

15/x
..have a similar situation with DG2-512, which Intel themselves previously saw at 6700 XT/3070 level.
On paper it's a real monster.
8 Geo+Raster units with a pixel fillrate of 128 per clock.
Not even AMD's & NV's high-end chips are that wide.
4+128 on N21, 7+112 on GA102

16/x
On paper DG2-512 beats GA104 in nearly all points by over 50%.
Be it triangle, pixel, texel or matrix INT8 throughput (used by XeSS/DLSS).
___
So yeah, I'm extremely looking forward to low-level benchmarks and how DG2 behaves. 😀

17/17

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Locuza

Locuza Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Locuza_

Nov 3, 2022
The N31 reveal got a couple of big surprises, in both good and bad ways.
A good surprise was AMD sharing die shots of Navi31, the GCD and MCD dies!

I took a first look at Navi31, which due to the usual pixel mess is simple and may include misinterpretations.

1/x ImageImage
Awkward pause, but a few things got me thinking, and I checked a few things.
In addition, I have little time, so I have to fire those semi-random thoughts quickly.
____

Because RDNA3 has no legacy pipeline anymore, you would expect less geometry processing hardware.
The...

2/x Image
command frontend in the past likely had the central geometry processor.
The area for the whole section took quite a bit more on N22 than the 1.5 MB L2$.
On Navi31 that section looks a lot smaller in relative terms.
I would love to know if everything is now compute emulated?

3/x Image
Read 22 tweets
Sep 2, 2022
The short Zen 4 die shot analysis is now freely available on YouTube!
Key slides and points will be also included in this Twitter thread, while the text version on Patreon and Substack will stay for paid subscribers only.

Well, actually I'm going to bed soon and may finish the Twitter thread later.
However, let's start with something.

1. Die sizes based on a package photo, a rendering of it and AMD's official product page listings.
Somebody with access should measure it directly with a caliper.
2. The Zen 4 die shot of the compute die which AMD included in their livestream presentation, stretched to the correct proportions and sharped.
Not perfectly done, but enough for the first overview.
Read 19 tweets
Jun 19, 2022
The Alder Lake-S/P walkthrough is now freely available on Patreon, YouTube and for the first time also on Substack!
P: patreon.com/posts/die-walk…
Y:
S: locuza.substack.com/p/die-walkthro…

A few highlights and extras will be mentioned in this Twitter thread.

1/x
Alder Lake-S is the first consumer chip which brought PCIe 5 support and it's always interesting to see how a new standard looks on a die shot.
With twice the transfer speed, are the PHYs larger?
AMD has PCIe3&4 blocks on basically the same node, which share the same size.

2/x
There are multiple PHY blocks supporting the same PCIe version with fairly large size differences.
Since the intrinsic scaling for analog devices with a new process node is small to non-existent, the size should be influenced by the block and packaging design (I/O density).

3/x
Read 30 tweets
May 3, 2022
This video includes what caught my eye after skimming through the open source driver patches for RDNA3.
It goes over IP versions, some feature definitions, FSR code lines & more.

Also on Patreon as a written text with images:
patreon.com/posts/65948696

1/x

This table compares the version number of the IP blocks used in AMD RDNA1, 2, and 3 discrete GPUs.

APUs and custom consoles are not taken into account.
Some IP blocks there have another major.minor.revision number.
Like Rembrandt uses SMU13.0.1 and VCN3.1.1.

2/x Image
Of particular interest is the Micro Engine Scheduler (MES) block, that is just now receiving sw support.
According to AMD's bridgman it is intended to replace the Compute Hardware Scheduler (HWS) and to provide hardware scheduling for the graphics queues for the first time.

3/x Image
Read 27 tweets
Nov 28, 2021
I proudly present... another audio mess..., I mean the second part of the DG2 Alchemist analysis and discussion.
As usual, the main points will also be covered in this twitter thread.


1/x 🧵
Die sizes of N22, GA104 and DG2-512.
Actually, the GA104 is likely closer to 400mm² with the scribe lines.
It's hard to make a fair comparison.
Different process nodes with other design trade-offs, differences in spending like for display, matrix units, ray tracing, etc.

2/x
A couple of cool things on DG2.
DisplayPort2.0, with at least 4x Pipes on DG2-512.
AV1 encoding support, AMD&Nvidia will come a couple of months later with RDNA3&Ampere-Next.
FF transcoding speed and quality is already strong/best on Xe LP vs. Turing/Ampere NVENC Gen7

3/x
Read 34 tweets
Nov 8, 2021
I finished it (!), 9 hours before AMD will officially present CDNA2/MI250X 🥳
It's basically the second rambling/analysis part for Aldebaran, going over some changes based on driver and compiler patches from AMD.
It's a technical mini spoiler, perhaps?

1/x
Disclaimer, I put that together in a short amount of time, there might be quite a few issues.
________
Because of the 110 CU notion from AMD's driver, it appears obvious to me that Aldebaran is not using 16 CUs per SE, but likely only 14 --> smaller chiplet size.

2/x
According to the patches, on Aldebaran always 2 CUs share 1x 32KiB I$ and 1x 16KiB K$.
1st image shows how the shader array is build on Vega10, Arcturus (CDNA1) and Aldebaran (CDNA2)
____
Aldebaran has a prefetch depth of 16 cache lines.
It's only 3 on RDNA/GFX10.

3/x
Read 25 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(