Sebastian Aaltonen Profile picture
Jul 19, 2023 18 tweets 4 min read Read on X
Thread comparing V-buffer and hardware TBDR approaches.

Some of my TBDR/POSH thoughts here:
When I was designing V-buffer style renderer in 2015 I was a bit concerned about having to run vertex shader 3 times per pixel. People might say that this is fine if you are targeting 1:1 pixel:triangle density like Nanite does, but that's cutting corners...
If you look at a generic triangle grid (like terrain on highly tessellated object surface), you have N*N vertices and (N-1) * (N-1) * 2 triangles. Shading vertices once and sharing the results costs N^2. Shading 3x vertices per triangle costs 6x more. That's a significant cost.
Common V-buffer implementation thus is equal to non-indexed geometry even on 1:1 pixel:triangle geometry. So you pay the 6x overhead. The algorithmic gain of having roughly constant amount of triangles in screen is of course a massive gain, but still we don't want 6x overhead.
A hardware TBDR implementation thus runs POSH (position shader) first to bin the triangles to tiles and then does per-tile index deduplication before running the full attribute (vertex) shader. IMR index buffer hardware is similar, so this is nothing dramatically new.
But there's a cost of fetching the vertex positions (and skinning matrices, etc) and running the vertex transform twice. The index/vertex buffer abstraction is not perfect for TBDR. The same is true for mesh shaders. You want to deduplicate offline and split to small meshlets.
When you have preprocessed your mesh to meshlets and you know tight local bounds to each meshlet, you can do fine grained per-meshlet viewport, backface and occlusion culling first. In this pass you don't have to access per-vertex data at all, which is a big saving.
If we assume that our clusters are small and area local, we can simply run vertex shader to all vertices in each visible meshlet and raster them. With mesh shaders you can single pass this without memory roundtrip, but this works only in forward shading setup.
Simplest approach with V-buffer is to run visible cluster vertex shaders in compute shader and write them to memory. This is a bit similar to ARMs older mobile architectures. But you only write visible meshlets, not all the geometry, which is a big improvement.
The way to do this without any memory roundtrip is to bin the clusters to screen space tiles, just like TBDR architecture bins triangles to screen space tiles. But there's order of magnitude less overhead, since clusters are more coarse.
Now when you are shading your tile, you basically run mesh shader to each visible cluster. One mesh wave per cluster generating one chunk of triangles for the rasterizer.
This works fine. But even if you have 1:1 triangle:pixel dense geometry, many of your clusters will span multiple screen space tiles and be transformed 2-4 times. Depends on tile size of course. If you have tiny local tile cache (similar to groupshared mem) then overhead is big.
So you would want to do per-triangle culling in the mesh shader. This is already possible with the triangle bit mask. But you still pay extra overhead for the vertex processing. This is hard to improve without having to repack the vertex waves.
If you wanted to do a hardware solution for TBDR GPUs, you would likely want to have an index list per meshlet and deduplicate those per tile and run compacted vertex waves per tile. A bit like HW index buffering. This has no memory roundtrip and no 2x vertex shading.
In 2015 I also thought about a couple of different software compaction schemes that I could run on my tile threads first, and then run vertex work on those threads and then pixel work. But load balancing is iffy when you only have fixed amount of threads for all the steps.
And that's why we ended up with the UV-buffer approach in the GPU-driven renderer I presented in 2015. I just didn't want to amplify the vertex workload by 3x+ (equivalent of non-indexed geometry per pixel). I wanted fixed cost with very good cache hit rate.
But the UV-buffer has severe limitations and is not general purpose enough for generic engines. Nanite shows that V-buffer is shippable today on high end, if you want to lean on temporal upscaler to reduce the extra per-pixel overhead.

advances.realtimerendering.com/s2015/aaltonen…
But on mobile you want more optimal solution. Especially on low/mid tier devices that are optimized for fast uniform buffer usage patterns intead of firing dozens of raw mem loads per pixel to load three vertices and their attributes per pixel.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Aaltonen

Sebastian Aaltonen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SebAaltonen

May 7
When you split a function to N different small functions, the reader also suffers multiple "instruction cache" misses (similar to CPU when executing it). They need to jump around the code base to continue reading. Big linear functions are fine. Code should read like a book.
Messy big functions with lots of indentation (loops, branches) should be avoided. Extracting is a good practice here. But often functions like this are a code smell. Why do you need those branches? Why is the function doing too many unrelated things? Maybe too generic? Refactor?
There's a rule of thumb that you write separate code for each call site until you have repeated yourself 3 times. Then you merge these together. But people often forget the opposite: You have to split a function if the call site requirements change. Don't add more branches!
Read 5 tweets
Jan 23
WebGPU CPU->GPU update paths are designed to be super hard to use. Map is async and you should not wait for it. Thus you can't map->write->render in the same frame.

wgpuQueueWriteBuffer runs in CPU timeline. You need to wait for callback to know buffer is not in use.

Thread...
Waiting for callback is not recommended in web and there's no API for asking how many frames you have in flight. So you have to dynamically create new staging buffers (in a ring) based on callbacks to use wgpuQueueWriteBuffer safely. Otherwise it will trash data used by GPU.
You are not allowed to map or wgpuQueueWriteBuffer a different region of a buffer used by any GPU frame in flight. You need entirely different buffer.
Read 10 tweets
Dec 17, 2024
I would really love to write a blog post about this whole debate. This is a too complex topic to discuss on Twitter.

Ubisoft was among the first to develop GPU-driven rendering and temporal upscaling. AC: Unity, Rainbow Six Siege, For Honor, etc. Our SIGGRAPH 2015 talk, etc...
Originally, TAA was seen as an improvement over screen space post-process AA techniques as it provided subpixel information. It wasn't just a fancy blur algo. Today, people render so much noise. Noise increases neighborhood bbox/variance. Which increases ghosting.
Today people don't do 1:1 TAA anymore. TAA shader does upscaling too. Thus you have maybe 1:4 samples compared to before. And much noisier signal. Plus people feed this reconstructed image to frame interpolator, which hallucinates even more data. This all combined is the problem.
Read 6 tweets
Nov 22, 2024
AMD UDNA will be interesting.

CDNA3 architecture is still based on GCN 4 cycle wave64 scheduling. RDNA schedules every cycle and exposes instruction latency. Scheduler runs/blocks instructions concurrently/dynamically. RDNA is much closer to Nvidia GPUs.

Thread... Image
CDNA has wide matrix cores and other wide compute workload improvements, which AMD wants to bring to UDNA. It also has multi-chip scaling.

Rumors tell that RDNA4 will finally have matrix cores in consumer space. Seems that AMD is integrating matrix cores early to RDNA lineup.
My expectation is that UDNA compute unit will be RDNA4 descendant instead of CDNA3 descendant. They definitely need 1 cycle low latency scheduling in consumer space, and Nvidia does well with it in AI space too. I don't see them going back to GCN-style design for UDNA.
Read 8 tweets
Nov 20, 2024
I am impressed about our new WebGPU WASM page load time. Whole engine loads in just a few hundreds of milliseconds. And games load pretty much instantly too.
This is how fast things should load. We have spent a massive effort optimizing the data structures. I also rewrote the renderer.

You can't get this kind of load times with lots of tiny memory allocs, shared pointers, hash map lookups, frequent mutexes and all other slow stuff.
Wasm size = 29.2MB. Not yet optimized with wasm-opt and it's not yet compressed. Will be smaller.

Games are not included in WASM binary. They are loaded from our cloud server separately.
Read 5 tweets
Oct 5, 2024
Let's talk about CPU scaling in games. On recent AMD CPUs, most games run better on single CCD 8 core model. 16 cores doesn't improve performance. Also Zen 5 was only 3% faster in games, while 17% faster in Linux server. Why? What can we do to make games scale?

Thread...
History lesson: Xbox One / PS4 shipped with AMD Jaguar CPUs. There was two 4 core clusters with their own LLCs. Communication between these clusters was through main memory. You wanted to minimize data sharing between these clusters to minimize the memory overhead.
6 cores were available to games. 2 taken by OS in the second cluster. So game had 4+2 cores. Many games used the 4 core cluster to run your thread pool with work stealing job system. Second cluster cores did independent tasks such as audio mixing and background data streaming.
Read 30 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(