Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Sebastian Aaltonen

@SebAaltonen

Jul 19, 2023 • 18 tweets • 4 min read • Read on X

https://x.com/SebAaltonen/status/1680586056352120833?s=20

Thread comparing V-buffer and hardware TBDR approaches.

Some of my TBDR/POSH thoughts here:

https://x.com/SebAaltonen/status/1680586056352120833?s=20

When I was designing V-buffer style renderer in 2015 I was a bit concerned about having to run vertex shader 3 times per pixel. People might say that this is fine if you are targeting 1:1 pixel:triangle density like Nanite does, but that's cutting corners...

If you look at a generic triangle grid (like terrain on highly tessellated object surface), you have N*N vertices and (N-1) * (N-1) * 2 triangles. Shading vertices once and sharing the results costs N^2. Shading 3x vertices per triangle costs 6x more. That's a significant cost.

Common V-buffer implementation thus is equal to non-indexed geometry even on 1:1 pixel:triangle geometry. So you pay the 6x overhead. The algorithmic gain of having roughly constant amount of triangles in screen is of course a massive gain, but still we don't want 6x overhead.

A hardware TBDR implementation thus runs POSH (position shader) first to bin the triangles to tiles and then does per-tile index deduplication before running the full attribute (vertex) shader. IMR index buffer hardware is similar, so this is nothing dramatically new.

But there's a cost of fetching the vertex positions (and skinning matrices, etc) and running the vertex transform twice. The index/vertex buffer abstraction is not perfect for TBDR. The same is true for mesh shaders. You want to deduplicate offline and split to small meshlets.

When you have preprocessed your mesh to meshlets and you know tight local bounds to each meshlet, you can do fine grained per-meshlet viewport, backface and occlusion culling first. In this pass you don't have to access per-vertex data at all, which is a big saving.

If we assume that our clusters are small and area local, we can simply run vertex shader to all vertices in each visible meshlet and raster them. With mesh shaders you can single pass this without memory roundtrip, but this works only in forward shading setup.

Simplest approach with V-buffer is to run visible cluster vertex shaders in compute shader and write them to memory. This is a bit similar to ARMs older mobile architectures. But you only write visible meshlets, not all the geometry, which is a big improvement.

The way to do this without any memory roundtrip is to bin the clusters to screen space tiles, just like TBDR architecture bins triangles to screen space tiles. But there's order of magnitude less overhead, since clusters are more coarse.

Now when you are shading your tile, you basically run mesh shader to each visible cluster. One mesh wave per cluster generating one chunk of triangles for the rasterizer.

This works fine. But even if you have 1:1 triangle:pixel dense geometry, many of your clusters will span multiple screen space tiles and be transformed 2-4 times. Depends on tile size of course. If you have tiny local tile cache (similar to groupshared mem) then overhead is big.

So you would want to do per-triangle culling in the mesh shader. This is already possible with the triangle bit mask. But you still pay extra overhead for the vertex processing. This is hard to improve without having to repack the vertex waves.

If you wanted to do a hardware solution for TBDR GPUs, you would likely want to have an index list per meshlet and deduplicate those per tile and run compacted vertex waves per tile. A bit like HW index buffering. This has no memory roundtrip and no 2x vertex shading.

In 2015 I also thought about a couple of different software compaction schemes that I could run on my tile threads first, and then run vertex work on those threads and then pixel work. But load balancing is iffy when you only have fixed amount of threads for all the steps.

And that's why we ended up with the UV-buffer approach in the GPU-driven renderer I presented in 2015. I just didn't want to amplify the vertex workload by 3x+ (equivalent of non-indexed geometry per pixel). I wanted fixed cost with very good cache hit rate.

But the UV-buffer has severe limitations and is not general purpose enough for generic engines. Nanite shows that V-buffer is shippable today on high end, if you want to lean on temporal upscaler to reduce the extra per-pixel overhead.

advances.realtimerendering.com/s2015/aaltonen…

But on mobile you want more optimal solution. Especially on low/mid tier devices that are optimized for fast uniform buffer usage patterns intead of firing dozens of raw mem loads per pixel to load three vertices and their attributes per pixel.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @SebAaltonen

Sebastian Aaltonen

@SebAaltonen

May 30

The past decades have been a wonderful time for gamers+devs. The biggest chips, using the latest nodes and trillions worth of R&D, were all targeted at gaming. Now, those chips are needed by professionals (AI). We'll never see a big die GPU at a reasonable price point anymore :(

The fun lasted for a very long time, but it's over in both CPU and GPU side. The biggest CPU and GPU dies are no longer designed for gamers. Top end Threadripper costs over 10k$ today. Top end Nvidia B200 costs over 30k$. Few generations ago top tier HW was targeting gamers :(

AMD no longer produces big-die GPUs for gamers. Nvidia has a low-volume 2500$+ Halo product. But it's much smaller than Nvidia's B200 GPU, which has two glued dies, each slightly bigger than RTX 5090. Chiplet GPUs like Threadripper are coming. Gaming GPUs limited to few chiplets?

Read 4 tweets

Sebastian Aaltonen

@SebAaltonen

May 18

https://twitter.com/kerckhove_ts/status/1923365950876729726

Unit tests have lots of advantages, but cons are ignored:
- Code must be split to testable parts. Often requiring more interfaces, which add code bloat and complexity.
- Each call site is a dependency. Test case = +1 dependency. Added inertia to refactor and throw away code.
...

https://twitter.com/kerckhove_ts/status/1923365950876729726

- Bloated unit test suites taking several hours to execute. Slows down devs and causes merge conflicts as pushes are delayed.
- Unstable tests randomly failing pushes.
- Unit test maintenance and optimization needed to keep tests manageable. Otherwise developer velocity hurts.

It's crucial to make your unit tests fast. Don't load files from disk and definitely don't do network requests. Embed data (bin->hdr tool for example). If your whole test suite runs in <10 seconds, then you are golden. But writing good optimized tests like this takes effort.

Read 9 tweets

Sebastian Aaltonen

@SebAaltonen

May 7

https://twitter.com/Jonathan_Blow/status/1919847600527716561

When you split a function to N different small functions, the reader also suffers multiple "instruction cache" misses (similar to CPU when executing it). They need to jump around the code base to continue reading. Big linear functions are fine. Code should read like a book.

https://twitter.com/Jonathan_Blow/status/1919847600527716561

Messy big functions with lots of indentation (loops, branches) should be avoided. Extracting is a good practice here. But often functions like this are a code smell. Why do you need those branches? Why is the function doing too many unrelated things? Maybe too generic? Refactor?

There's a rule of thumb that you write separate code for each call site until you have repeated yourself 3 times. Then you merge these together. But people often forget the opposite: You have to split a function if the call site requirements change. Don't add more branches!

Read 5 tweets

Sebastian Aaltonen

@SebAaltonen

Jan 23

WebGPU CPU->GPU update paths are designed to be super hard to use. Map is async and you should not wait for it. Thus you can't map->write->render in the same frame.

wgpuQueueWriteBuffer runs in CPU timeline. You need to wait for callback to know buffer is not in use.

Thread...

Waiting for callback is not recommended in web and there's no API for asking how many frames you have in flight. So you have to dynamically create new staging buffers (in a ring) based on callbacks to use wgpuQueueWriteBuffer safely. Otherwise it will trash data used by GPU.

You are not allowed to map or wgpuQueueWriteBuffer a different region of a buffer used by any GPU frame in flight. You need entirely different buffer.

Read 10 tweets

Sebastian Aaltonen

@SebAaltonen

Dec 17, 2024

https://twitter.com/ThreatInteract/status/1867084725988516101

I would really love to write a blog post about this whole debate. This is a too complex topic to discuss on Twitter.

Ubisoft was among the first to develop GPU-driven rendering and temporal upscaling. AC: Unity, Rainbow Six Siege, For Honor, etc. Our SIGGRAPH 2015 talk, etc...

https://twitter.com/ThreatInteract/status/1867084725988516101

Originally, TAA was seen as an improvement over screen space post-process AA techniques as it provided subpixel information. It wasn't just a fancy blur algo. Today, people render so much noise. Noise increases neighborhood bbox/variance. Which increases ghosting.

Today people don't do 1:1 TAA anymore. TAA shader does upscaling too. Thus you have maybe 1:4 samples compared to before. And much noisier signal. Plus people feed this reconstructed image to frame interpolator, which hallucinates even more data. This all combined is the problem.

Read 6 tweets

Sebastian Aaltonen

@SebAaltonen

Nov 22, 2024

AMD UDNA will be interesting.

CDNA3 architecture is still based on GCN 4 cycle wave64 scheduling. RDNA schedules every cycle and exposes instruction latency. Scheduler runs/blocks instructions concurrently/dynamically. RDNA is much closer to Nvidia GPUs.

Thread...

CDNA has wide matrix cores and other wide compute workload improvements, which AMD wants to bring to UDNA. It also has multi-chip scaling.

Rumors tell that RDNA4 will finally have matrix cores in consumer space. Seems that AMD is integrating matrix cores early to RDNA lineup.

My expectation is that UDNA compute unit will be RDNA4 descendant instead of CDNA3 descendant. They definitely need 1 cycle low latency scheduling in consumer space, and Nvidia does well with it in AI space too. I don't see them going back to GCN-style design for UDNA.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Sebastian Aaltonen

Try unrolling a thread yourself!

More from @SebAaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!