(1/53)Hi everybody, for my 2nd major post I will be focusing on the complete memory hierarchy today, how games utilize RAM, and how it applies to consoles specifically Xbox today.
(2/53)As like last time, I don't own the specific hardware and merely making an analysis. At this point, please leave toxic fanboying behind. Everybody needs a break from it. Take time to sit back and appreciate the hardware design process.
(3/53)So the first thing we will do is to define the full memory hierarchy. In a processor, register files are the smallest level. They are designed to take a clock cycle before they can be used.
(4/53)This close level of integration and low latency limits the amount of register entry spaces available. Over time programs neesed more space leading to the use of punch cards which later evolved into the blu-ray drives of today.
(5/53)These removable storage devices can be quite compact with large amounts of storage, but they are physically connected externally to the processor. This leads to them being slower transfer and very delayed transfers.
(6/53)These two opposite poles generated a middle ground: Main Memory. More properly known as RAM in the common implementation. RAM or Random Access Memory allows a programmer the ability to far more quickly access data with lower latency.
(7/53)This is because RAM is directly connected via the motherboard to the processor. The delay is from the distance the electrical current has to flow from the RAM chip to the Processor chip.
(8/53)Now the speed of electricity is fast, but when compared to how fast processors are, can waste a lot of time waiting for data. This led to the introduction of various cache levels. Cache is localized RAM on the processor.
(9/53)Cache still has some latency to travel to the Execution Units of the processor especially compared to register files, but they can be substantially larger in size. I will leave breaking down the full cache hierarchy for another day.
(10/53)Now the slow transfer of external storage was becoming a bottleneck. This led to 2 new mediums Hard Disc Drives and Solid State Drives. HDD physically spin while SSD uses flash storage. I might do a further writeup on just SSD and RAM.
(11/53)Now that we established all the levels let us discuss the two important metrics: bandwidth and latency. Bandwidth is the raw data transfer in bits or bytes per second.
12/53)Latency is the delay between when a request to access memory occurs and when the data is received by the Execution Unit. The closer to the processor the higher the bandwidth and the lower the latency. Let us shift gears for a minute.
(13/53)We now are going to focus on how memory is allocated in Killzone Shadow Fall because Guerrilla Games broke down how they use their memory. We will not cover everything, but all the big stuff will be covered.
(14/53)Let us focus on System Memory first. Notice that Guerrilla listed elements by the amount of RAM they used. You will notice that Sound eats the most system memory. This will likely continue for next gen.
(15/53)Havok Scratch is for game physics. Game Heap is the traditional game logic. And then there is the excess logic entities. Notice how little System RAM is taking up. Video RAM takes up a lot more.
(16/53)Now what about VRAM? Notice how non-streaming textures are the biggest category. We will come back to textures later. Notice the 2nd largest is Render Targets.
(17/53)Render Targets are a collection of elements of the scene being rendered before it is put in the framebuffer. It is directly tied to resolution. Elements can be built with various resolutions. It does not need to be the output res.
(18/53)The 3rd Largest is the streaming pool. A streaming pool is the destination of asset streaming from your HDD/SSD. I know defining everything was pretty dry. These definitions are important to understanding the Series S.
(19/53)I rewrote it several times and decided that everybody would prefer simple definitions. If anybody wants a more detailed explanation I answer questions after I post. Now let us get to the core of this post: Xbox Series S.
(20/53)I have been seeing a lot of misinformation about the XSS. Let us just dispel the first big claim I have seen. No the XSS is not a reskinned X1X. They have completely different architectural structures.
(21/53)Next rumor I have seen is about the XSS being weaker than the X1X or that the XSS will hold back next gen games. Let us break down why both of these arguments don't have evidence. We will focus on CPU first.
(22/53)Here is the layouts of AMD Jaguar in the X1X and Zen 2 used in the XSS. Notice how Zen 2 has doubled the ALUs over Jaguar. Zen 2 also has twice as many registers in its register file to keep up with the doubling in performance.
(23/53)The L1 and L2 cache are identical in size, but have increased bandwidth. Zen 2 also has an L3 cache above and beyond as well. Remember what I said about memory hierarchy?
(24/53)The increase in bandwidth and the lowering of latency has massively increased memory efficiency for the CPU over the X1X. This does not even include the higher CPU clock speed.
(25/53)Now a similar thing happens for the GPU. Some people see the X1X has 6TFLOPs vs XSS 4TFLOPs. This is the peak shader operations of these consoles. There are bottlenecks that will prevent this peak throughput.
(26/53)The new GPU architecture used in these next gen consoles, has decreased the latency per operation. RDNA has also twice as many registers per CU over the X1X. Then there is a completely different cache hierarchy.
(27/53)In GCN there is an L1 cache shared between 4CUs. RDNA has renamed it L0 and there is now one for every CU. They replaced the L1 cache with a new one applying to every 10CUs. L2 cache will likely be the same.
(28/53)I am assuming that the XSS is the RDNA2 version of Navi 14 pictured. That means the L2 cache has far higher bandwidth than the X1X even if they have the same size. This means the GPU architecture will perform way better.
(29/53)This is before we consider VRS, Ray Tracing, DirectML, Mesh Shading, and Sampler Feedback Streaming. Now let us move onto RAM hierarchy. The X1X has 12GB GDDR5 while the XSS has 10GB GDDR6.
(30/53)The memory configuration is important. The X1X has 12 1GB Modules. The XSS uses a clamshell design splitting RAM into 4 levels. 2 levels have 4GB in them.  The remaining 2GB have their own levels and is dedicated to OS.
(31/53)Now some people are suggesting that the XSS has too low bandwidth. This is actually not true at all. The GPU effective bandwidth of the XSS is higher than the X1X. Although X1X has 12 1GB modules, the GPU does not access all of them.
(32/53)Believe it or not memory is an architectural design branch of Computer Engineering. It has its own metrics in hit/miss rate, instruction latency, and probability distrubition of memory accesses. As you can guess GDDR6 is more efficient memory architecture than GDDR5.
(33/53)Another key factor of memory bandwidth is how much GPU throughput is required. The XSS may have lower memory bandwidth compared to PS5/XSX, but it is a far lower compared to the GPU compute.
(34/53)If the XSS has memory bandwidth issues, then the XSX and PS5 have far worse issues with it. There is an issue with the XSS though. RAM size is pretty low. So how is Xbox planning to overcome this issue?
(35/53)The SSD is the key to this. This is where the Killzone Shadow Fall memory breakdown is important. Notice the streaming pool size. This was on a HDD. The XSS SSD is substantially faster. This allows the streaming pool to be larger.
(36/53)There is more though. Sampler Feedback Streaming makes it easier to have in RAM only the texture parts that are actually needed. The next section will be breaking down textures. We need to define texture resolution.
(37/53)A big problem in the industry is how we define "4K textures". This term is fully subjective and it changes from dev to dev. This is a big issue as it makes it hard to follow. So I will define my own meaning for texture resolution.
(38/53)When I say 2K textures I mean 2048x2048, 4K textures are 4096x4096, 8K textures are 8192x8192. This is just a simplification, so now let me introduce an example to add clarity. Here is a 2K texture:
(39/53)Textures are built to apply to a portion or a complete model. I have seen some discussion of whether you can see 4K textures on a 1080p screen. You can because the textures wrap around objects and you can get closer to them.
(40/53)Appropriate texture filtering can make the visual fidelity higher as a result. If you notice how much RAM textures take up, you realize how much SFS will save. Shoutout to @Gavavva the drawing.
(41/53)Thanks to his drawing, we can highlight a key point of SFS. The red area is facing forward on screen. Everything outside of that is a wasteful area of the texture to load with our memory hierarchy here.
(42/53)Now with all these clear improvements then why is the XSS not running X1X Backwards Compatibility? It comes down to how X1X games are coded. Malloc or memory allocation for the X1X will be different compared to XSS.
(43/53)The XSS having 2GB less which unchanged code will simply break BC support. However, if a dev patches a game they could likely have higher detail than the X1X. Honestly this is the same scenario for the PS5/XSX.
(44/53)Now onto the final part. Will the XSS hold back games? Let us do an analysis. The simply answer would be to lower resolution on the XSS. However, we may run into RAM issues compared to the PS5/XSX.
(45/53)We are going to shrink 2 areas: Textures and Render Targets. Textures make a lot of sense. The lower resolution of the XSS will make higher quality textures not pop as much. This would be a minimal downgrade in visual fidelity.
(46/53)Now what about render targets? Well these are largely particle effects, shadow quality, and post processing. By lower the resolution of these effects will conserve even more RAM. At worst case devs can also drop LOD settings.
(47/53)The only reason a dev would be held back would be due to them not having the resources to build any scalable graphics. And if a dev does not have that then it is highly unlikely they are pushing the visual envelope.
(48/53)At this point I would like to do a PSA: programmers have a lot of stress in their lives and this problem extends into game dev culture. If a dev disagrees with your opinion, attacking them with toxicity is not a good argument.
(49/53)Attack someone's ideas, but never attack the people. If you think they are biased, being a toxic fanboy is not going to make them like your brand more. I have seen a bunch of devs attacked this week. Everybody just chill.
(50/53)Also performing analysis on computer hardware is hard enough as it is. Most typical computer science programs will take 1 to 2 courses on HW. Asking a programmer do this could be quite difficult.
(51/53)Final lesson: here is Bloom's Taxonomy of Learning. As you move up the pyramid the harder it is to master. Trust me, I would rate my Synthesis and Evalution skills at a 5/10 and 4/10 respectively for Architectural Design.
(52/53)It took me a long time to master analysis. At this point I issue an apology. I was telling people I was targeting a Sunday release and fell behind in edits until today. I will do better with next one on power consumption and the PS5.
(53/53)As always I will answer any questions people have. Thanks for the read and go chill and play your favorite game if things get heated.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with LeviathanGamer

LeviathanGamer Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @LeviathanGamer2

2 Sep
@oladleye @LiquidTitan @RedGamingTech There are some fairly simple comparisons we can make. Currently RDNA1 has a sizeably lower IPC than Turing. Meaning Turing SMs go farther than RDNA1. I will break down why. (1/x)
@oladleye @LiquidTitan @RedGamingTech RDNA1 64 shader cores split into two SIMD units called 2xSIMD32 units. RDNA1 can schedule 1 workload to each of these units that will execute in one clock cylce. Turing is different. It has its 64 shader cores split into 4xSIMD16 units. (2/x)
@oladleye @LiquidTitan @RedGamingTech In the diagram provided you can see this structure that each gets its own block for 4 total. Each SIMD16 can be given a full SIMD32 unit workload and take 2 clock cycles to complete. So it should be equivalent right? Nope. There is more. Say we are on clock cycle 1. (3/x)
Read 10 tweets
21 Aug
(1/57)Hello everybody, for this inaguaral post I will be making a massive post breaking down everything we know from Hot Chips and in general for the Xbox Series X. This will be a massive 57 tweet thread.
(2/57)I will preface this that although I am a Computer Engineer, I have not programmed an FPGA matching these specs, so any performance gains will be based off my analysis and/or public information.
(3/57)Please do not use any materials presented to stoke any console wars. Only use for an informative manner. I will be comparing to the X1, X1X, AMD CPU/GPU lineups, and SOME PS5 features FOR academic comparisons. Let us begin:
Read 58 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!