Sebastian Aaltonen Profile picture
Oct 18 17 tweets 4 min read Read on X
Let's discuss why I think 4x4x4 tree is better than 2x2x2 (oct) tree for voxel storage.

It all boils down to link overhead and memory access patterns. L1$ hit rate is the most important thing for GPU performance nowadays.

Thread...
2x2x2 = uint8. That's one byte. Link = uint32 = 4 bytes. Total = 5 bytes.

4x4x4 = uint64. That's 8 bytes. Total (with link) = 12 bytes.

4x4x4 tree is half as deep as 2x2x2 tree. You get up/down twice as fast.
Voxel mask (in non-leaf nodes) tells us which children are present. You can do a popcnt (full rate instruction on GPU) to count the children. Children are allocated next to each other. Children sub-address can be calculated with binary prefix sum (= AND prev bits + popcnt).
In a 4x4x4 tree, our uint64 gives the ray at least 4 steps (DDA) worth of data using a single memory load. In best case 12 steps. These extra steps allow the ray to get further from a surface faster. When you exit the 4x4x4 brick, you are further away...
In a 2x2x2 tree, you go two steps and then go one level up, and then another step or two and another level up. 4x4x4 tree doesn't need the middle step at all. Middle step = different memory location = potential cache miss. We want to avoid this.
In Claybook we had 8-bit SDF mip chain. World gen generated every mip, but my ray-tracing algo only used every other mip as that was faster. The reason was better cache utilization. Every mip level is stored separately. Extra mips = bigger working set.

gdcvault.com/play/1025030/A…
This is the same reason why 4x4x4 tracing ends up being faster. It's touches less different cache lines, and transitions faster to near/far trace (fast empty space skip <-> near surface root finding).
2nd reason why 4x4x4 tree is better is the amount of metadata compared to payload. For each 64 bits of voxels, we have 32 bits of link (offset). That's 66% payload / 33% link. 2x2x2 tree has 20% payload / 80% link. In data structure design, you always want to minimize metadata.
3rd reason is cache line utilization. 5 byte node is just 3.9% of a 128 byte cache line. 12 byte node = 9.3%. Since children are allocated next to each other (shared link), a full set of children = 40 bytes in octtree and 768 bytes in 4x4x4 tree. 40 bytes = 31% of a cache line.
Thus even if we have all 8 children present, the octree still doesn't utilize the cache lines fully. Partially used cache lines will be evicted from the tiny L1$, and you pay for data you didn't use. In realistic case we have less than 8 children, so this issue is even worse.
For the 4x4x4 tree, the children often span multiple cache lines (up to 6), since they are continuously allocated. This is great for cache access pattern, especially in cases of nearby voxels representing a continuous surface, which is a common pattern for real content.
Why not 8x8x8 or 16x16x16 trees? If you already allocate children next to each other 4x4x4 is good enough as it puts children in the same cache line implicitly. 4x4x4 fits nicely in a single uint64, so it's efficient to load and process in a single register. 8x8x8 = 512 bits.
Since voxel octrees often encode just the surface, a more narrow region is more storage efficient. Bigger nodes increase the ratio of payload:metadata, but the ratio is already pretty good in 4x4x4 tree. Most nodes are leaf nodes and leaf nodes don't need the link.
The last reason to avoid 2x2x2 tree is that GPUs are not optimized for 1-byte memory loads. Leaf nodes without metadata are just 1-byte. 64-bit loads are much more optimal. You basically get 64-bits for the same price as 8-bits.
Of course we have to ask: Why not a single cache line aligned nodes. That's iffy because, A) are storing sparse data (surface near band is not a tile), B) GPU SIMD branch/loop coherency issues (homogenerous tiles are important). Allocating children contiguously is almost as good.
And most important last: Do not mix material/texture data in your ray-tracing data structure. Keep it minimal. Find the ray hits first. Then do material/texture stuff. You don't want to load material/texture data to L1$ for voxels that you miss.
If you don't want to store extra links for material/texture data, then you can just do 1:1 mapping (two base pointers) or AoSoA-style alternating arrays with cache line aware strides. There's lots of ways to achieve material/texture data separation efficiently.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Aaltonen

Sebastian Aaltonen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SebAaltonen

Oct 18
People often think voxels take a lot of storage. Let's compare a smooth terrain. Height map vs voxels.

Height map: 16bit 8192x8192 = 128MB

Voxels: 4x4x4 brick = uin64 = 8 bytes. We need 2048x2048 bricks to cover the 8k^2 terrain surface = 32MB. SVO/DAG upper levels add <10%
The above estimate is optimistic. If we have a rough terrain, we end up having two bricks on top of each other in most places. Thus we have 64MB worth of leaf bricks. SVO/DAG upper levels don't increase much (as we use shared child pointers). Total is <70MB. Still a win.
Each brick has uint64 voxel mask (4x4x4) and 32 bit shared child data pointer (can address 16GB of voxel data due to 4 byte alignment). A standard brick is 12 bytes. Leaf bricks are just 8 bytes, they don't have the child pointer (postfix doesn't cripple SIMD coherence).
Read 4 tweets
Oct 11
"It's much faster, performance and to load things"

Zuck's reasoning started with performance. Performance matters. Google maps won because of performance, Nokia lost because of Symbian OS not being designed for real-time systems (touchscreen needs it). Unity has to wake up.
Performance was also the main reason Unity's Weta Digital acquisition failed. You can't just buy a movie company and believe that real time game engine + movies = real time movies. A massive amount of optimization and architecture refactoring work is required to make it happen.
This was also the main reason I left Unity. I got tired of pushing performance to middle/top level managers who didn't seem to understand that we NEED to prioritize high performance tech such as DOTS and GPU-driven rendering if we want to make real-time movies happen.
Read 10 tweets
Sep 30
When you design data structures, always think in cache lines (64B or 128B). You don't want to have tiny nodes scattered around the memory. Often it's better to have wider nodes (preferably 1 cache line each) and shallower structures. Less pointer/offset indirections.
When implementing spatial data structures, you also need to think about spatial locality. If you have early outs, then consider embedding early out conditions as a bitfield to the spatial data structure directly, instead of fetching objects before the early out.
Consider embedding small objects directly. Example: Claybook's GPGPU fluid sim got 2x faster when I embedded particles inside the SPH grid. Previously the particle pool was accessed with 32-bit indices. New design improved GPU L1$ hit rate to 90% (from 60%) = 4x less misses.
Read 13 tweets
Sep 28
What do you guys do when you get a sudden urge to build a 100,000 player MMO with fully destructible world? You do the math that it could run on a single 128 core Threadripper server (everybody in the same world) with your crazy netcode ideas and super well optimized C code...
Before Claybook we had a multiplayer prototype with 1GB SDF world state modified with commands (deterministic static world state, undeterministic dynamic object state). Tiny network traffic. Could easily scale this idea to 1TB world (2TB RAM on that Threadripper)...
HypeHype has interesting ghost multiplayer, where you don't directly interact with other players. They are ghosts for you, but you can interact with events. This reduces latency concerns. This idea can be taken further.
Read 17 tweets
Sep 26
Finland was a key tech player 20 years ago: We invented SSH and IRC protocols. Nokia was EUs most expensive company, selling more phones yearly than Apple and Samsung sell today combined. We invented the OS that runs most internet servers today. Nokia failed and Linux is free...
Finland has some new successes: Wolt being the biggest EU food delivery service, Oura being the first health ring and Silo AI being one of EUs biggest AI companies. Wolt got sold to Doordash ($3.5B), Silo AI got sold to AMD ($665M). Oura is still a $11B Finnish company.
We also had a Finnish Facebook before Facebook. Irc Galleria was used by all young adults when I was studying at university. Most girls I knew used it, which should have told investors a lot. But they never made it international.
Read 6 tweets
Sep 19
I have realized that there's not that many people out there who understand the big picture of modern GPU hardware and all the APIs: Vulkan 1.4 with latest extensions, DX12 SM 6.6, Metal 4, OpenCL and CUDA. What is the hardware capable of? What should a modern API look like?
My "No Graphics API" blog post will discuss all of this. My conclusion is that Metal 4.0 is actually closest to the goal. It has flaws too. DX12 SM 6.6 doesn't have those particular flaws, but has a lot of other flaws. Vulkan has all the flaws combined, with useful extensions :)
Of course WebGPU doubled down on Vulkan's design mistakes. Bind groups are immutable and there's no escape hatches for dynamic bindings. No persistently mapped GPU memory. And a brand new shader language without 64-bit pointer support.
Read 20 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(