Locuza Profile picture
28 Aug, 28 tweets, 10 min read
And @FritzchensFritz said, "Let there be light!" and there was light:
flickr.com/photos/1305612…

Now, that the world has an incredible high quality PS5 die shot, I revisit my previous annotations and some crucial aspects are different than I thought.


1/x 🧵
It was premature from me to claim that Sony likely cut the FP pipes from 256b to 128b based on totally dark rectangles.
I should have worded it with much more uncertainty, because some people, and reportings, take it sometimes as a fact.
The custom FPU on the PS5...

2/x
supports the same instructions as a vanilla Zen2 core (The 4700S is using the PS5 SoC):
bodnara.co.kr/bbs/article.ht…

Some parts of the execution logic and the FP-Scheduler appear to be the same.
And even the FP register file might be...


3/x
mostly (or even fully?) intact, if AMD was able to compress the design on a smaller area as @lamchester suggested:


In relation to the questions, which especially console warriors burn to know about, is this affecting the game performance?

4/x
Microbenchmarks of the 4700S SKU might shine some light on that, however based on the final frametimes, which are bound by thousands of different factors anyway, I think it's fair to say that this doesn't matter.
There is no significant (dis/)advantage for either console.

5/x
But after all, why was the request made by Sony to customize the FPU design and invest time and effort?
I wondered about that in the beginning and I still wonder about it now.
With signficant cuts on the FPU side two arguments come into mind, area and thermal density.
Maybe

6/x
it was done to fit into a specific floor plan design?
I think that is unlikely, there is even empty space on the left and right side of the Zen2 CCXes, the FPUs could have been designed even a bit larger.
Moreover the GPU hw instances are more flexible with the dimensions.

7/x
Various GPU instances from AMD exist, with differences in length and height, not something you usually see on the CPU side.
Comparing the PS5 to Navi10 and the XSX, it has the tallest and shortest WGP design, so it appears as if the whole design was optimized to be narrower

8/x
However from the beginning the floor plan could have been designed differently.
Fitting vanilla Zen2 cores wouldn't have been any issue.
__
Next idea, thermal density.
Microsoft showcased that the CPU FPUs are the worst offenders:
anandtech.com/show/16489/xbo…

9/x
With significant cuts on the FPU side, be it execution logic and registers, thermals would go down.
However, the custom FPU on the PS5 appears to have most of the logic still included, just crammed into far less space, so if anything thermal density should be even worse.

10/x
As such I have no idea what the motivation point behind the custom FPU design was, all thoughts are welcomed in that regard. 😄
___
Anyway, a couple of comments on the CPU and GPU design (differences).

Both the Xbox Series X/S and the PS5 use two Zen2 CCX clusters with..

11/x
4MiB per CCX as AMD's Renoir design.
Outside of the FPU design, all of them look (nearly*) identical.


The motivation point is clearly to save area, 16MiB L3$ costs 16.80mm² vs. 5.64mm² for 4MiB.


12/x
All of them use the same underlying interconnection infrastructure, two CCXes are connected to the Scalable Data Fabric (Part of the Infinity Fabric marketing), Cache Coherent Masters have the memory map, Coherent Slaves are responsible for cache coherency.

13/x
Xbox Series X Hot Chips presentation:
tomshardware.com/news/microsoft…

Zen2 optimization guide:
gpuopen.com/wp-content/upl…

There is no unified cache as some rumors claimed before hand (Zen2 cores with Zen3 like L3 cache design), cross CCX latency is huge:
anandtech.com/show/15708/amd…

14/x
GPU:
XSX has physically 28 WPGs (3584 "Shader Cores") but only 26 are active (3328 cores).

PS5 has physically 20 WGPs (2560), 18 are active on final products (2304).

Render frontend appears nearly the same.
2x Primitive Units, 2x Rasterizer per Shader Engine.

15/x
PC RDNA2 should look differently.
They only use one primitive unit and one rasterizer (rasterizer with twice the throughput).
So in that regard both the PS5 and XSX/XSS are structured like RDNA1 designs from AMD.
(Yes, RDNA"1.5" confirmed for both 🤡, sarcasm is here)

16/x
Another interesting aspect in that regard is the Render Backend.
PS5 is going ham on this, area wise.
They most likely use the older RB design with 4 Colors ROPs + 16 Depth ROPs.
There are 72 cROPs in total, 64 are active (16RBs from 18).
Huge area footprint.

17/x
The old render backend heavily hinted towards the lack of hardware Variable Rate Shading, which is confirmed by now.
The Xbox Series X/S use the newer RB+ design, which AMD introduced for their PC RDNA2 lineup.
8 Colors ROPs + 16 Depth ROPs per RB+ with hw VRS support.

18/x
The XSX uses no extra ROPs for yield, 8 RB+ instances are physically present and have to be all working for final products.
Area footprint for ROPs appears nearly half in comparison to the PS5.
(Not sure if the latter can benefit from more Depth/Stencil ROPs in practise)

19/x
PC RDNA2 GPUs have a new "Infinity Cache", which is just an extra cache level, in this case a larger Last Level Cache, adding 16-128MiB.
PS5 and XSX don't have another cache level, after 4MiB (PS5) or 5MiB (XSX) of L2$ you have to go to memory.

20/x
Recently the claim was floating around that the PS5 is using Infinity Fabric to access the L3$ from the CPU.
There are multiple coherency modes and GPU can snoop CPU caches but this applies to both consoles and AMD's hw (same infrastructure as mentioned).

21/x
However, the CPU caches are not part of the cache hierarchy of the GPU and the L3$ per CCX is only directly accessible by those cores and not other clients.
Going over the Infinity fabric incurs extra cost and is used selectively.

22/x
Currently there are no PC RDNA2 die shots available, so we can only compare PS5/XSX vs. RDNA1 GPUs.
The WGP design looks different accross all three, in relation to the SRAM placement inside various subunits.
However XSX&PS5 share much more similarities.

23/x
But I can also mention that some subunits inside the Xbox Series S WGPs look also quite differently vs XSX and PS5 (TMUs &L0$).
What else...
The Tempest Engine (Audio accel.) is based on an AMD GPU Core, however, it removed all unnecessary hw and has no caches.

24/x
I did not see a hardware instance which jumps into eye with similarities on the digital logic or the SRAM macro side.
Central block with command & geometry engine is larger than on Navi10/14.
It has multiple duplicated structures (the XSX also, but its blurry).

25/x
Outside of a closer look on the FPU design, there is not really something new which I noticed.
The information standard is basically the same as it was when @GPUsAreMagic posted their die shot analysis of the PS5.
It displays more details than I do:


26/x
I'm done for the time being with console analysis, so I'm not going to dig into finer details but @GPUsAreMagic is usually very good at comparing structures and making analysis images.
So keep an eye on them, if you are curious about die shot analysis.

27/x
In closing, sorry for spamming your timeline, but I hope some information was interesting.

PS: If you read something about secret NDA block on Xbox Series, dedicated Matrix Cores, stacked chips or Ray Traversal acceleration, leave that zone immediately.
Same on PS side.

28/28

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Locuza

Locuza Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Locuza_

27 Aug
Well, @FritzchensFritz got hands on a PS5 again and did some awesome die shots!
Vanilla Zen2:
flickr.com/photos/1305612…

PS5 CPU Core:
flickr.com/photos/1305612…

The custom Zen2 CPU for Sony is only modified on the FPU side, digital logic and everything else looks identical.

1/x ImageImage
The custom FPU is now quite a bit shorter, aligning with the µcode ROM block.
Overall core size goes down from ~2.82mm² to ~2.50mm².
Vanilla Zen2 is ~13% larger, respectively the PS5 core is ~11% smaller.
The FPU register file got some cuts, optically it looks about 1/4 smaller than on vanilla Zen2.
So instead of 160x 256-Bit regs it could be about 120x 256-Bit regs in total.
In terms of digital logic, I would direct to a previous "analysis":
Read 4 tweets
19 Jul
A discussion and curiosity is resolved now.
Van Gogh, which is used by Valve's Steam Deck, has 4 UMCs.
I expected 4x 16-Bit (a memory channel under LPDDR5 is actually 16-Bit wide).
The official spec claimed 5.5 Gbps (dual-channel), which didn't made sense to me.
It got corrected
Valve claims now 4x 32-Bit (128-Bit) which fits to 4 UMCs.
It also means that as on Renoir/Cezanne, AMD is using a controller design with a 32-Bit granularity instead of 16-Bit channels.

Even 64-Bit LPDDR5 wouldn't have been bad for the Steam Deck specs but now bw looks great.
In comparison to current gen consoles, only from the GPU perspective, you get more GB/s per TeraFLOP.
A small comparison:
XSX: 46.09 GB/s per GPU TFLOP
XSS: 55.91 GB/s per GPU TFLOP
PS5: 43.58 GB/s per GPU TFLOP
Steam Deck: 53.72-85.94 GB/s per GPU TF
Read 6 tweets
7 Feb
This was a nightmare project to work on, with a frankenstein audio recording mash up but I can't muster the strength and necessary time to re-record+cut the thing again.

Topic and details are quite interesting though.
Summary pictures follow this thread.

1) Xbox Series X/S die shots scaled to relative true size.
(It's not super accurate though)
2) PS4&Xbox One die shots scaled to relative true size
3.) ^ with annotations

If people are curious about the PS4/Xbox One gen, I could make an extra video for them. ImageImageImage
1) PS4 Pro and Xbox One X die shots scaled to relative true size.
2) ^ with annotations

Again, if someone wants a deep dive on those chips, I could make a video on it. ImageImage
Read 21 tweets
2 Sep 20
Ahh damn it, I again didn't managed a super fast rambling video about Renoir vs. Tiger Lake.
So it's time for a picture thread with rambling, less than 30 minutes to go.
1/x
I really like the CPU engine from Intel.
Willow Cove has a massive amount of cache and should do over 20% better per clock than Renoir.
Under 15-30W I'm also sceptical how well the 8 cores on Renoir scale but the results are out there, I just didn't had the time to look.
2/x
IIRC some 3DMark results showed ~50% higher CPU scores on Renoir vs. Tiger Lake models but that would be totally okay.
For me the device will be mostly for browsing and some casual games, I rather take the better ST performance.
3/x
Read 30 tweets
1 Sep 20
I think even for a speed rambling video the time is too short with Ampere's presentation coming in less than three hours which is why a picture thread with my thoughs will follow.
Ampere vs. Big Navi.🔥
1/x
The specs with 5248 "CUDA cores" are already out there for the GTX3090.
@_rogame found the configuration of Navi21 from driver files, confirming that 40WGPs/80CUs/5120 "cores" will be used.
Bringing both close together in terms of FP32 throughput
2/x
If 84 SMs are the maximum configuration of the GA102 chip then 6 GPCs are fitting, with 14SM each.
With 6 GPCs we have 6 Rasterizer/Scan Converter = 96 Pixels per clock.
ROPs are tied to Memory Controllers, with 384-Bit we have 96 ROPs = 96 Pixels/clock
3/x
Read 20 tweets
24 Jul 20
Zen 2 (+1) part 6 analysis for laymen is done:


Comparing the sizes of CCX, L3$, L2$, Core and starting with the L3$ item/device inventory.
Images with crucial information follow this tweet.
1) Technical details and differences between Zen1 and Zen2
2) Zen1 CCX size of 44.11mm²
3) Zen2 CCX size of 31.39mm²
4) Zen1 L3 Cache size of 16.32mm²
5) Zen2 L3 Cache size of 16.82mm²:
6) Zen1 and Zen2 L3 Caches side by side, scaled to relative true size
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(