Sebastian Aaltonen Profile picture
Jul 19, 2023 18 tweets 4 min read Read on X
Thread comparing V-buffer and hardware TBDR approaches.

Some of my TBDR/POSH thoughts here:
When I was designing V-buffer style renderer in 2015 I was a bit concerned about having to run vertex shader 3 times per pixel. People might say that this is fine if you are targeting 1:1 pixel:triangle density like Nanite does, but that's cutting corners...
If you look at a generic triangle grid (like terrain on highly tessellated object surface), you have N*N vertices and (N-1) * (N-1) * 2 triangles. Shading vertices once and sharing the results costs N^2. Shading 3x vertices per triangle costs 6x more. That's a significant cost.
Common V-buffer implementation thus is equal to non-indexed geometry even on 1:1 pixel:triangle geometry. So you pay the 6x overhead. The algorithmic gain of having roughly constant amount of triangles in screen is of course a massive gain, but still we don't want 6x overhead.
A hardware TBDR implementation thus runs POSH (position shader) first to bin the triangles to tiles and then does per-tile index deduplication before running the full attribute (vertex) shader. IMR index buffer hardware is similar, so this is nothing dramatically new.
But there's a cost of fetching the vertex positions (and skinning matrices, etc) and running the vertex transform twice. The index/vertex buffer abstraction is not perfect for TBDR. The same is true for mesh shaders. You want to deduplicate offline and split to small meshlets.
When you have preprocessed your mesh to meshlets and you know tight local bounds to each meshlet, you can do fine grained per-meshlet viewport, backface and occlusion culling first. In this pass you don't have to access per-vertex data at all, which is a big saving.
If we assume that our clusters are small and area local, we can simply run vertex shader to all vertices in each visible meshlet and raster them. With mesh shaders you can single pass this without memory roundtrip, but this works only in forward shading setup.
Simplest approach with V-buffer is to run visible cluster vertex shaders in compute shader and write them to memory. This is a bit similar to ARMs older mobile architectures. But you only write visible meshlets, not all the geometry, which is a big improvement.
The way to do this without any memory roundtrip is to bin the clusters to screen space tiles, just like TBDR architecture bins triangles to screen space tiles. But there's order of magnitude less overhead, since clusters are more coarse.
Now when you are shading your tile, you basically run mesh shader to each visible cluster. One mesh wave per cluster generating one chunk of triangles for the rasterizer.
This works fine. But even if you have 1:1 triangle:pixel dense geometry, many of your clusters will span multiple screen space tiles and be transformed 2-4 times. Depends on tile size of course. If you have tiny local tile cache (similar to groupshared mem) then overhead is big.
So you would want to do per-triangle culling in the mesh shader. This is already possible with the triangle bit mask. But you still pay extra overhead for the vertex processing. This is hard to improve without having to repack the vertex waves.
If you wanted to do a hardware solution for TBDR GPUs, you would likely want to have an index list per meshlet and deduplicate those per tile and run compacted vertex waves per tile. A bit like HW index buffering. This has no memory roundtrip and no 2x vertex shading.
In 2015 I also thought about a couple of different software compaction schemes that I could run on my tile threads first, and then run vertex work on those threads and then pixel work. But load balancing is iffy when you only have fixed amount of threads for all the steps.
And that's why we ended up with the UV-buffer approach in the GPU-driven renderer I presented in 2015. I just didn't want to amplify the vertex workload by 3x+ (equivalent of non-indexed geometry per pixel). I wanted fixed cost with very good cache hit rate.
But the UV-buffer has severe limitations and is not general purpose enough for generic engines. Nanite shows that V-buffer is shippable today on high end, if you want to lean on temporal upscaler to reduce the extra per-pixel overhead.

advances.realtimerendering.com/s2015/aaltonen…
But on mobile you want more optimal solution. Especially on low/mid tier devices that are optimized for fast uniform buffer usage patterns intead of firing dozens of raw mem loads per pixel to load three vertices and their attributes per pixel.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Aaltonen

Sebastian Aaltonen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SebAaltonen

Nov 2
Wouldn't this be a lovely hosted server for a hobby proto MMO project? 48 core Threadripper, 256GB RAM, 4TB SSD. 1Gbit/s unlimited.

Should be able to handle 10,000 players just fine. That's a start. 1Gbit/s = 100MB/s. 10KB/s send+receive for each player. = great! Image
I was talking about 100,000 players before, but that's an aspirational goal for a real MMO game with paid customers. 10,000 players is a fine start point for prototyping. Will be difficult to even get that many players even if it's a free web game (no download).
10k players data replicated to 10k players = 100M player datas sent. At 100MB send bandwidth this means 1 byte per player on average per second. That's more than enough with a great compressor. Netflix video compressor uses ~0.1 bits per pixel.
Read 14 tweets
Nov 1
It's depressing that software engineering mostly wastes the hardware advantages to make programming "easier" and "cheaper" = sloppy code. Every 2 decades we get 1000x faster hardware (Moore).

I'd like to see real improvements, like 1000x more players MP:
If people still wrote code as optimally as me, Carmack and others did in the late 90s, we could achieve things that people today think are not even possible. Those things are not impossible to achieve if we really want. And that's why I think I need to do this hobby project too.
We wrote a real-time MP game for Nokia N-Gage: in-order 100MHz CPU, no FPU, no GPU, 16MB RAM, 2G GPRS modem with 1 second latency between players. We had rollback netcode (one of the first). We just have to think outside the box to make it happen. Why is nobody doing it anymore?
Read 9 tweets
Nov 1
I've been thinking about a 100,000 player MMO recently (1 server, 1 world) with fully distributed physics (a bit like parallel GPGPU physics). Needs a very good predictive data compressor. Ideas can be borrowed from video compressors. 4K = 8 million pixels. I have only 100k...
100k players sending their state to server is not a problem. That's O(n). The big problem is updating every other player state to every player. That's O(n^2). And at 100k players that's 100k*100k = 10G. Server can't obviously send 10G player state infos at acceptable rate.
There must be a fixed budget per player. Otherwise the server will choke. This is similar to fixed bandwidth rate in the video compressors. If there's too much hard to compress new information, then the quality automatically drops.
Read 8 tweets
Oct 23
AI generated C is a real deal. C coders wrote fast & simple code. No high freq heap allocs, no abstractions slowing the compiler down. Lots of good C example code around. Ai workflows need a language with fast iteration time. Why waste compile time and perf on modern languages?
If you generate C++ with AI, it will use smart pointers and short lived temp std::vectors and std::strings like all slow C++ code bases do. Lots of tiny heap allocs. Idiomatic Rust is slightly better, but idiomatic Rust is still a lot more heap allocs than C. It's so easy.
And why would you even think about generating Python with AI? Why would you choose a 100x slower language if AI is writing it instead of you? Same applies to Javascript and other common internet languages. Just generate C and compile to WASM. Nothing runs faster.
Read 5 tweets
Oct 18
Let's discuss why I think 4x4x4 tree is better than 2x2x2 (oct) tree for voxel storage.

It all boils down to link overhead and memory access patterns. L1$ hit rate is the most important thing for GPU performance nowadays.

Thread...
2x2x2 = uint8. That's one byte. Link = uint32 = 4 bytes. Total = 5 bytes.

4x4x4 = uint64. That's 8 bytes. Total (with link) = 12 bytes.

4x4x4 tree is half as deep as 2x2x2 tree. You get up/down twice as fast.
Voxel mask (in non-leaf nodes) tells us which children are present. You can do a popcnt (full rate instruction on GPU) to count the children. Children are allocated next to each other. Children sub-address can be calculated with binary prefix sum (= AND prev bits + popcnt).
Read 17 tweets
Oct 18
People often think voxels take a lot of storage. Let's compare a smooth terrain. Height map vs voxels.

Height map: 16bit 8192x8192 = 128MB

Voxels: 4x4x4 brick = uin64 = 8 bytes. We need 2048x2048 bricks to cover the 8k^2 terrain surface = 32MB. SVO/DAG upper levels add <10%
The above estimate is optimistic. If we have a rough terrain, we end up having two bricks on top of each other in most places. Thus we have 64MB worth of leaf bricks. SVO/DAG upper levels don't increase much (as we use shared child pointers). Total is <70MB. Still a win.
Each brick has uint64 voxel mask (4x4x4) and 32 bit shared child data pointer (can address 16GB of voxel data due to 4 byte alignment). A standard brick is 12 bytes. Leaf bricks are just 8 bytes, they don't have the child pointer (postfix doesn't cripple SIMD coherence).
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(