A modern gaming CPU such as the 5950X has achievable memory bandwidth of 35 GB/s. At 144 fps that's 35 GB/s / 144 fps = 243 MB/frame. That's how much unique memory you can access in 1 frame.

5800X3D has 96 MB of LLC. Almost the whole working set fits to the cache. Thread...
That 243 MB/frame figure is highly optimistic. It assumes that everything your run on the CPU is bandwidth bound, and it assumes you never access the same data twice. It's common to produce the data first and consume later in the frame. You access the same cache lines twice.
So it's likely that we already have games that fit their whole working set in the 96 MB cache of 5800X3D, assuming the game runs at 144 fps of course. Since games are highly temporally coherent, there's only a few percent change in the working set between the frames.
I think this is the reason why 5800X3D is so fast in many games. The LLC is finally large enough to contain the whole working set, especially for games designed to run at 144 fps or higher. All data accessed in previous frame are already in the cache, improving the latency A LOT.
Since 120Hz and 144Hz seem to be the new normal for gaming, it's iteresting to see whether the CPU vendors will continue on this path. Aiming to provide caches large enough to contain the whole working set, being able to provide cache hit for all last frame data.
Some games double buffer their data, some games mutate the current data directly. Mutating one set of data results in smaller working set, but is trickier to parallelize.
Frame temp allocators (bump allocators with frame life time) can also increase the working set, as there's a conservative assumption that all temp data lives until end of the frame, and those memory regions are not reused during the frame.
If CPUs with massive LLCs become common, we might need to reconsider the way we implement our data processing. Minimizing the working set and ensuring the working set is temporally coherent (next frame) becomes crucial for best performance.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Aaltonen

Sebastian Aaltonen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SebAaltonen

Apr 23
First draft of a job system API.

LaunchJob() schedules a job. Job is eligible for execution when the array of existing jobs have been finished. Job scheduler only schedules jobs if they don't have resource conflicts. 1x ReadWrite or multiple ReadOnly for each resource...
Resources are 64 bit void pointers. Job body is declared as a lambda function. It can capture data from the scope. It's common to capture the resources and some other data by value. Can also capture pointer to temp allocated storage (linear frame allocator).
If you want to run an ECS system on top of this, each component table in the ECS system would be a resource (64 bit pointer to an object). ReadWrite access ensures that only one job can run at once modifying the same component table. Multiple ReadOnly accesses run concurrently.
Read 16 tweets
Apr 23
Let's have a thought experiment: All future gaming CPUs will be like 5800X3D. CPU vendors will ensure that game working sets fit in LLC. You pay main memory bandwidth only when temporal coherence is not perfect.



What would this mean for OOP and DoD?
Let's start with the most discussed OOP performance flaws: Pointer chains and partial cache lines. Pointer targets are in LLC since they were accessed also last frame. Partial cache line reads are not as bad as the remaining data will be read later this frame (still in cache).
You still pay extra cost for pointer access and partial cache line reads. LLC is slower than L1$, but the extra cost is now much more manageable. Processing 1 object at a time also requires more setup and can't be vectorized efficiently. These problems remain.
Read 11 tweets
Apr 21
Subpass test: RGBA8 lit buffer + 3x RGBA8 G-buffers (filled with RGB) + "lighting shader" (=shows different G-buffer for each 16x16 region) + forward transparency on top. All in the 1 renderpass (using Vulkan subpasses).

60 fps (low end Androids) and all devices are still cool.
The Xiaomi Redmi phone in the middle doesn't seem to have FPS counter in developer menu. Also I can't find a way to increase display shut down timer on that device.
Will add G-buffer decals next (blended to 3x MRT) and more transparencies to add overdraw. Hopefully it's still 60 fps at that point.

Then I add loops to the triangle draw calls to simulate increased load from increased draw call counts (g-buffer, decals, transparencies).
Read 5 tweets
Apr 18
I would really like to have VRS on mobile, because it allows you to render at multiple resolutions in the same tile buffer, without having to resolve data to main memory.

On Xbox 360 we experimented with rendering particles using 4xMSAA on top of 1 sample buffer...
We just aliased a 4xMSAA target on top of the 1 sample target in the EDRAM and rendered particles this way. Same for Z buffer. It kind of worked. But the pattern unfortunately was messy 4xMSAA pixels didn't form nice 2x2 quads in screen space. Z test also worked :)
4xMSAA particles were faster than half res particles, and had pixel perfect Z test. Didn't need an extra resolve + combine for half res particles. Nowadays with VRS you can finally do the same in a proper way.
Read 5 tweets
Apr 18
Decima Engine (Horizon Forbidden West) is using the same XOR trick I described few months ago in my V-buffer SV_PrimitiveID Twitter threads. We of course found it independently.

PC implementation uses SM 6.1 GetAttributeAtVertex.

github.com/microsoft/Dire…
This technique is good because it works on hardware that doesn't guarantee leading vertex order. XOR is order independent.

I measured that the performance of this technique was identical to baseline (no primitive ID), so it's as fast as it gets. Unfortunately it requires SM 6.1.
The original thread:


After that I posted several threads with performance analysis of various V-buffer SV_PrimitiveID approaches. Leading vertex is the best fit for HW that guarantees leading vertex order. Otherwise you want to use the XOR trick (SM 6.1).
Read 4 tweets
Apr 17
Let's talk about storing vertices in optimal memory footprint.

I have seen people using full fat 32 bit floats in their vertex streams, but that's a big waste, especially on mobiles. All our 60 fps Xbox 360 used a 24 byte layout with a mixture of 16 bit and 10 bit data.
The most common optimization is to avoid storing the bitangent and reconstructing it with cross product. However it's worth noting that UV mirroring causes the bitangent sign to flip. You need to store the mirror bit somewhere. RGB10A2 offers you a nice 2 bit alpha for the sign.
10 bits per channel is enough for the normal/tangent/bitangent. Some visually impressive last gen AAA games have shipped with rgb888 normal in the G-buffer. If that's enough for your mobile game, then RGB10 is enough in your vertex for all of these 3.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(