Finished watching all the Nanite presentations and the RenderDoc analysis blog posts by people.

It's time to do a thread comparing Nanite to our (Ubisoft/RedLynx) 2015 GPU-driven rendering tech. The base pieces are the same, but Nanite adds some awesome new/old innovations.
Our joint GPU-driven rendering SIGGRAPH presentation with the Assassin's Creed team.

advances.realtimerendering.com/s2015/aaltonen…
Mesh Clustering: Meshes split to small clusters based on locality. Back then we used strip clustering, but switched to index buffer generation. Everybody nowadays uses index buffer generation (or mesh shaders). Our cluster size was 64, Nanite is using 128.
We had GPU-driven LOD solution. Choose a set of clusters for each instance based on distance. Artists had to manually author LODs and the tech demanded very good LODs. Nanite is much better here. They have innovative per-cluster seamless LOD solution. And automatic LOD authoring.
This is our two-phase occlusion culling solution. Use previous frame data as a starting point for the first pass and then fill missing clusters in the second pass. RenderDoc captures show that Nanite is using the same algorithm. This is a good algorithm. MM Dreams also uses it.
As said by Brian Karis in the Nanite presentations, precise occlusion culling is crucial for high density kit bashed content. Perfect LOD + fine grained culling = fixed triangle raster cost. Our game had UGC focus (non-professional content), they want to make artists life easier.
This is a screenshot showing culling efficiency when rendering a scene in center of 250K object "asteroid field". As you can see, techniques like this are perfect for partial occlusion, such as object clouds and vegetation. The visibility gets cut once all pixels are filled.
Virtual shadow mapping is important for high density rendering. It provides close to 1:1 pixel:texel mapping for all areas in the scene, and rejects excess work that would not land on the visible pixels. 3.5x faster and much better quality.

VTSM makes Nanite's shadows look good!
We implemented a deferred texturing pipeline with UV buffer. Nanite chose the V-Buffer approach:
jcgt.org/published/0002…

Both techniques defer the material/texture pixel cost to make overdraw super efficient. The trade-offs are however different.
Both employ a single cheap geometry pass (no Z-prepass). The MSAA trick can be combined with either technique and is 100% lossless for V-Buffer. But Epic is writing small triangles with compute shader (software raster), and you can't write to MSAA target from a compute shader :(
The MSAA trick is also quite hardware specific. On previous generation consoles you had to directly read the FMASK to decode it quickly. Modern AMD RDNA and Nvidia GPUs can directly load from MSAA buffer. Hardware compatibility makes it still iffy for a generic engine like UE5.
Both our tech and Nanite are leaning heavily on Virtual Texturing. With VT you have roughly 1:1 memory:pixel footprint for all your textures, independent on the scene complexity. This is super important for scenes like this, and makes artists life much easier.
We used deferred texturing UV-buffer with virtual texturing to implement single draw call rendering. Nanite is using V-Buffer and tiled material classification pass. Split/Second (Black Rock / Sumo Digital) pioneered this technique on Xbox 360. They were WAY ahead of their time.
Nanite is rendering a material id z-buffer and is employing hi Z and early Z to cull full screen material passes. They render a grid and NaN cull vertices based on the classification pass. This is similar to Split/Second. Xbox 360 XPS is replaced with compute shader of course :)
Intel's original V-buffer wasn't feasible as it didn't fit in 32 bits. Each instance had vastly different triangle count. When you combine clustering and early culling together, you get super tight numbering scheme. Nanite is using this kind of V-buffer.

forum.beyond3d.com/threads/modern…
The biggest achievement in Nanite is the way they combine all of this tech together. Great ideas collected in a single well optimized product. Their automatic cluster LOD is a massive improvement. Their software triangle raster beats hardware by 3x for small triangles.
My analysis of Nanite perceived "weaknesses":
Now let's analyze the weaknesses our our 2015 GPU-driven tech.

The biggest one by far was the LOD solution. We had GPU-driven LOD, but only at instance granularity and no artist support. Artists had to manually author all LODs by hand. We didn't even have offline tools for this.
I remember our artists exporting a million rock instances cloud from Houdini and adding that on top of our terrain. It was a pain to get running well. They only had time to author 3 LODs for the rocks. That's nowhere near enough for results similar to Nanite.
Our LOD selection was from the object center point and our artists preferred very small hand authored objects. And they instanced a lot of them. There was no megascans, etc available back then. The problem with small instances is that you hit 1 cluster/object pretty soon.
There's no way to LOD below 1 cluster. You need to render all those 64 triangles (or 128 in Nanite) even if that object is just a couple of pixels in screen. This is less of a problem for Nanite as the demos are using larger megascan assets that cover more space.
Also our tech was running on AMD GCN2 based consoles, and GCN2 is notoriously bad for geometry processing (VS max occupancy = 2). Nanite is running on RDNA2 and they have a software rasterizer for small triangles. This is a massive advantage for 1 cluster "min LOD" instances.
Nanite has per-cluster LOD with lots of autogenerated LOD levels. They have more SW performant rasterizer for small triangles and better modern GPU. Additionally their content uses larger scanned meshes covering more space instead of massive amount of small instances.
Simple mesh instances don't really LOD well, not even with their advanced cluster LOD. So if your content is massively instanced soup of these, you will not get 1:1 triangle:pixel. SW raster helps here too of course. HW raster is VERY bad for subpixel triangles.
I am interesting in knowing what kind of improvements Brian was talking about when he said that they are planning to improve the performance in scenes with massive amount of instances. Are they just making fixed costs smaller or doing some sort of instance merging.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Aaltonen

Sebastian Aaltonen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SebAaltonen

4 Jun
Huawei's VP of Cyber Security was interviewed by @KauppalehtiFi. He proposed 7 day (80h) weeks with no weekends. He hasn't got a summer vacation in four years. And he is worried that Finland can't compete with China / Silicon Valley without doing the same.

Thread...
I can only talk about programmers here, as I am not familiar with other people. Programmers write bad code when they are stressed out. They can still write a lot of code, but the architecture designs are terrible and there's lot of technical debt.
The only way to make programmers thing about the architecture (bigger picture) is to give them some breathing room. People I know often get their best ideas outside the work (walking the dog, swimming, biking to home, etc). Best ideas never occur during stress.
Read 6 tweets
4 Jun
I have a bad motion sickness and this marketing picture makes me feel nauseous. I don't want a car with more screen. If I sit in the back seat, I sit in the middle and stare to the road. Same when I am driving the car. I stare to the road. Give me a projected HUD instead of this.
These big panorama glass roofs on the other hand help with motion sickness. I am glad that many new cars have them. Many new EVs have projected HUDs too. Tesla doesn't yet have it. You need to glance to the screen regularly, which is painful for motion sick people.
Because of my motion sickness I have developed a very smooth style of driving. It was easy to drive like this on other EV's since they have control over regenerative braking balance (between gas/brake pedal). But Tesla removed this in 2021 models. Big red flag for me.
Read 5 tweets
13 Nov 20
With the resizable BAR support getting more adaptation. Standard swizzle is becoming more important too:
docs.microsoft.com/en-us/windows/…

With standard swizzle, you can do the swizzle on CPU side (even store swizzled textures on disk) and write to GPU memory directly without a copy.
I am just wondering how good the standard swizzle support is nowadays. AFAIK only Intel supported this feature in the beginning. What's the current situation? Is it supported on Nvidia and AMD? If yes, is it fast?
If the optimal tiling layout is 5%+ faster than standard swizzled, then there's no point in using it. Just pay the GPU copy cost for better runtime performance. But if the cost is tiny, then simply use RBAR memory for everything :)
Read 4 tweets
13 Nov 20
Let me explain the PCI-E resizable BAR: docs.microsoft.com/en-us/windows-…

Why is it better to get full GPU memory visible from CPU side compared to a small 256 MB region?

Thread...
Traditionally people allocate an upload heap. Which is CPU system memory visible to the GPU.

The CPU writes data there, and the GPU can directly read the data over PCI-E bus. Recently I measured 28 GB/s GPU read bandwidth from CPU system memory over PCI-E 4.0.
The two most common use cases are:

1. Dynamic data: CPU writes to upload heap. GPU reads it from there directly in pixel/vertex/compute shader. Examples: constant buffers, dynamic vertex data...

2. Static data: CPU writes to upload heap. GPU timeline copy to GPU resource.
Read 15 tweets
12 Nov 20
Going to implement GPU-driven occlusion culling to my SDF cube renderer next. This is important part of the final sparse SDF renderer.

Even in a random sparse cloud (1M cubes) the occlusion eventually wins. No background color is seen.

Thread...

I am going to implement a depth pyramid based approach first.

Would also like to test the new Nvidia extension that eliminates all pixels of a triangle after the first passed one. This way you don't even need a depth pyramid. Just write to visible bitfield in pixel shader.
New GPU culling algorithm:
1. Render last frame's visible list
2. Generate depth pyramid from the Z buffer
3. Do a 2x2 sample test for each instance using gather (refer to my SIGGRAPH 2015 presentation)
4. Write newly visible instances also to buffer B
5. Render visible list B
Read 4 tweets
12 Nov 20
What should I implement next to my Rust Vulkan prototype?
It has plenty of occlusion potential, even though it's a sparse asteroid field of 1 million instances. Should be able to cull 90%+ easily...
I need the occlusion culling for efficient rendering of the sparse volume. Otherwise the brick raster results in overdraw. However the backfaces of SDF bricks terminate the root finding immediately as the ray starts from the inside. Could early out normal calc too...
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(