Thread by @stefan_3d on Thread Reader App

Back of the envelope calculation:
RTX 2080Ti: 10GRay/s @ 616GB/s mem bandwidth = 61 bytes/Ray
1 triangle, 3x 32 bit float3 vertices: 48 bytes
61 - 48 = 13 bytes left for BVH traversal

That would be under an ideal BVH that requires only 1 ray triangle intersection/ray

Compressed wide BVH (research.nvidia.com/sites/default/…) requires 80 Bytes per BVH node. A balanced BVH8 over 1 million triangles is 7 level deep, so we're looking at 80 bytes * 7 = 560 bytes of processed data per ray. Times ten gigarays/s = 5.6 TB/s of bandwidth just for BVH traversal.

#Volta #V100 has 12-14TB/s shared memory bandwidth (arxiv.org/pdf/1804.06826…), so 10GRays/s are plausible if most of the data fits in L1 cache/shard mem.
V100 has 80 SMs with 128KB L1/shared mem each, a total of 10MB. 10MB aren't enough to fit a 7 levels deep BVH8.

Global mem bandwidth is already exhausted with triangle data, and as the die size of #Turing is claimed to be smaller than #Volta at the same process, I don't expect there to be much room for more L1 mem.

I may very well have some errors in my calculations, so please point them out if you see them.
If I'm right though, either @nvidia has some tricks up their sleeve that I don't know of, or 10 GRays/s are only possible with small data sets and coherent rays.

And real-world performance will be limited by memory bandwidth very quickly, leaving a lot of the ray/triangle intersection hardware idle.
But I'm coming from a film standpoint, maybe for game ray tracing, mesh sizes in the 10k poly range are plenty.

Now ignoring cache size, incoherent global mem access penalty etc and blindly calculating with a 95% cache hit rate:
research.nvidia.com/sites/default/… … claims 1-6kB of mem traffic per ray in Figure 3 - let's say 2kB avg. At 5% cache miss, that's 104 bytes/ray.

Still significantly more than the 61 bytes/ray our global memory budget is under ideal conditions.

And obviously, at this point no shading has happened yet. If memory bandwidth is the bottleneck for RTX, then ray differentials and mip mapping are a must for optimal performance.

Don't get me wrong, I'm thrilled about the wide availability of ray tracing hardware, and hope that soon we'll see this from other hardware vendors. Can't wait to have hardware under my fingers.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll