Tweet

More from @SebAaltonen

Sebastian Aaltonen

@SebAaltonen

Apr 23

First draft of a job system API.

LaunchJob() schedules a job. Job is eligible for execution when the array of existing jobs have been finished. Job scheduler only schedules jobs if they don't have resource conflicts. 1x ReadWrite or multiple ReadOnly for each resource...

Resources are 64 bit void pointers. Job body is declared as a lambda function. It can capture data from the scope. It's common to capture the resources and some other data by value. Can also capture pointer to temp allocated storage (linear frame allocator).

If you want to run an ECS system on top of this, each component table in the ECS system would be a resource (64 bit pointer to an object). ReadWrite access ensures that only one job can run at once modifying the same component table. Multiple ReadOnly accesses run concurrently.

Read 16 tweets

Sebastian Aaltonen

@SebAaltonen

Apr 23

Let's have a thought experiment: All future gaming CPUs will be like 5800X3D. CPU vendors will ensure that game working sets fit in LLC. You pay main memory bandwidth only when temporal coherence is not perfect.

https://twitter.com/SebAaltonen/status/1517801311009390592?s=20&t=lHj4uuFcDQ0Pf8loi6KMHQ

What would this mean for OOP and DoD?

Let's start with the most discussed OOP performance flaws: Pointer chains and partial cache lines. Pointer targets are in LLC since they were accessed also last frame. Partial cache line reads are not as bad as the remaining data will be read later this frame (still in cache).

You still pay extra cost for pointer access and partial cache line reads. LLC is slower than L1$, but the extra cost is now much more manageable. Processing 1 object at a time also requires more setup and can't be vectorized efficiently. These problems remain.

Read 11 tweets

Sebastian Aaltonen

@SebAaltonen

Apr 21

Subpass test: RGBA8 lit buffer + 3x RGBA8 G-buffers (filled with RGB) + "lighting shader" (=shows different G-buffer for each 16x16 region) + forward transparency on top. All in the 1 renderpass (using Vulkan subpasses).

60 fps (low end Androids) and all devices are still cool.

The Xiaomi Redmi phone in the middle doesn't seem to have FPS counter in developer menu. Also I can't find a way to increase display shut down timer on that device.

Will add G-buffer decals next (blended to 3x MRT) and more transparencies to add overdraw. Hopefully it's still 60 fps at that point.

Then I add loops to the triangle draw calls to simulate increased load from increased draw call counts (g-buffer, decals, transparencies).

Read 5 tweets

Sebastian Aaltonen

@SebAaltonen

Apr 18

I would really like to have VRS on mobile, because it allows you to render at multiple resolutions in the same tile buffer, without having to resolve data to main memory.

On Xbox 360 we experimented with rendering particles using 4xMSAA on top of 1 sample buffer...

We just aliased a 4xMSAA target on top of the 1 sample target in the EDRAM and rendered particles this way. Same for Z buffer. It kind of worked. But the pattern unfortunately was messy 4xMSAA pixels didn't form nice 2x2 quads in screen space. Z test also worked :)

4xMSAA particles were faster than half res particles, and had pixel perfect Z test. Didn't need an extra resolve + combine for half res particles. Nowadays with VRS you can finally do the same in a proper way.

Read 5 tweets

Sebastian Aaltonen

@SebAaltonen

Apr 18

Decima Engine (Horizon Forbidden West) is using the same XOR trick I described few months ago in my V-buffer SV_PrimitiveID Twitter threads. We of course found it independently.

PC implementation uses SM 6.1 GetAttributeAtVertex.

github.com/microsoft/Dire…

This technique is good because it works on hardware that doesn't guarantee leading vertex order. XOR is order independent.

I measured that the performance of this technique was identical to baseline (no primitive ID), so it's as fast as it gets. Unfortunately it requires SM 6.1.

The original thread:

https://twitter.com/SebAaltonen/status/1480890520884944900?s=20&t=RB8vqV_8U9GJ7suEsTkJuQ

After that I posted several threads with performance analysis of various V-buffer SV_PrimitiveID approaches. Leading vertex is the best fit for HW that guarantees leading vertex order. Otherwise you want to use the XOR trick (SM 6.1).

Read 4 tweets

Sebastian Aaltonen

@SebAaltonen

Apr 17

Let's talk about storing vertices in optimal memory footprint.

I have seen people using full fat 32 bit floats in their vertex streams, but that's a big waste, especially on mobiles. All our 60 fps Xbox 360 used a 24 byte layout with a mixture of 16 bit and 10 bit data.

The most common optimization is to avoid storing the bitangent and reconstructing it with cross product. However it's worth noting that UV mirroring causes the bitangent sign to flip. You need to store the mirror bit somewhere. RGB10A2 offers you a nice 2 bit alpha for the sign.

10 bits per channel is enough for the normal/tangent/bitangent. Some visually impressive last gen AAA games have shipped with rgb888 normal in the G-buffer. If that's enough for your mobile game, then RGB10 is enough in your vertex for all of these 3.

Read 11 tweets

Share this page!

Sebastian Aaltonen

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @SebAaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?