Tweet

Sebastian Aaltonen

13 Nov, 15 tweets, 3 min read

Let me explain the PCI-E resizable BAR: docs.microsoft.com/en-us/windows-…

Why is it better to get full GPU memory visible from CPU side compared to a small 256 MB region?

Thread...

Traditionally people allocate an upload heap. Which is CPU system memory visible to the GPU.

The CPU writes data there, and the GPU can directly read the data over PCI-E bus. Recently I measured 28 GB/s GPU read bandwidth from CPU system memory over PCI-E 4.0.

The two most common use cases are:

1. Dynamic data: CPU writes to upload heap. GPU reads it from there directly in pixel/vertex/compute shader. Examples: constant buffers, dynamic vertex data...

2. Static data: CPU writes to upload heap. GPU timeline copy to GPU resource.

Reading from CPU memory on GPU side has limited bandwidth. Around 28 GB/s over PCI 4.0 and 14 GB/s over PCI 3.0. Thus for all persistent GPU data, you want to copy them from the upload heap (in CPU memory) to GPU memory (200-1000 GB/s). Only keep single use data in upload heap.

Why don't you copy single use data to GPU memory too? Copy operation is also restricted to the same PCI-E bus width. Thus the copy would likely take the same time as reading the data in the shader. So there's no gain here. Only copy if you are using the data repeatedly...

GPUs have already exposed 256 MB pool of CPU visible memory. While this pool is great for small dynamic data, it doesn't help us with the extra copy operation we must do for static/reusable data. 256 MB is way too small for big static data such as textures.

The biggest gain from RBAR is thus that we can access the whole GPU memory from CPU side. Meaning that we can write persistent data directly from CPU side to GPU, without needing an additional copy. The data written by CPU will be optimal to access for GPU immediately!

In some corner cases we also see benefits in dynamic data. If the dynamic data is read once and is cache friendly, there's no difference. However if the dynamic data access pattern is not optimal, the GPU might need to load the same cache line multiple times over PCI-E.

RBAR helps for non-cache friendly dynamic data (such as skinning matrices). Since you can fit all your dynamic data to the GPU memory (without having to do gymnastics with 256 MB pool), you guarantee that accessing that data is fast even when the mem access pattern is not optimal

I recently made a mistake in putting my 1 million cubes index buffer to GPU visible CPU memory (similar memory type as common upload heap). I was ONLY reading the index buffer from CPU memory, and even on PCI-E 4.0, this bottlenecked the rendering. GPU upload made it 5x faster.

This result shows that AMD's claims of "Smart Access" memory providing up to 11% gains are completely reasonable. There are cases where the PCI-E memory access is the bottleneck. Even if the throughput is not the bottleneck, the added latency causes stalls, which add up.

The benefit is the largest for frames where you upload a lot of new stuff, and avoid the GPU copies. This doesn't affect the average frame rate that much. It reduces stalls/judder and improves the minimum frame rate.

However if you can avoid direct upload heap reads in shaders, you will see some performance gains even in frames where no loading happens. 11% is not unreasonable here at all. Especially if those loads are not cache friendly. Even constant buffer load over PCI-E has small impact.

I can measure this once Nvidia releases their new driver that exposes RBAR on Vulkan.

https://twitter.com/SebAaltonen/status/1327171850485559296?s=20

If you intent to write texture data directly to GPU memory using the RBAR, you need to use standard swizzle.

docs.microsoft.com/en-us/windows/…

So far there's no Vulkan support for standard swizzle.

Related tweet:

https://twitter.com/SebAaltonen/status/1327171850485559296?s=20

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @SebAaltonen

Sebastian Aaltonen

@SebAaltonen

13 Nov

With the resizable BAR support getting more adaptation. Standard swizzle is becoming more important too:
docs.microsoft.com/en-us/windows/…

With standard swizzle, you can do the swizzle on CPU side (even store swizzled textures on disk) and write to GPU memory directly without a copy.

I am just wondering how good the standard swizzle support is nowadays. AFAIK only Intel supported this feature in the beginning. What's the current situation? Is it supported on Nvidia and AMD? If yes, is it fast?

If the optimal tiling layout is 5%+ faster than standard swizzled, then there's no point in using it. Just pay the GPU copy cost for better runtime performance. But if the cost is tiny, then simply use RBAR memory for everything :)

Read 4 tweets

Sebastian Aaltonen

@SebAaltonen

12 Nov

https://twitter.com/SebAaltonen/status/1326899126169186306?s=20

Going to implement GPU-driven occlusion culling to my SDF cube renderer next. This is important part of the final sparse SDF renderer.

Even in a random sparse cloud (1M cubes) the occlusion eventually wins. No background color is seen.

Thread...

https://twitter.com/SebAaltonen/status/1326899126169186306?s=20

I am going to implement a depth pyramid based approach first.

Would also like to test the new Nvidia extension that eliminates all pixels of a triangle after the first passed one. This way you don't even need a depth pyramid. Just write to visible bitfield in pixel shader.

New GPU culling algorithm:
1. Render last frame's visible list
2. Generate depth pyramid from the Z buffer
3. Do a 2x2 sample test for each instance using gather (refer to my SIGGRAPH 2015 presentation)
4. Write newly visible instances also to buffer B
5. Render visible list B

Read 4 tweets

Sebastian Aaltonen

@SebAaltonen

12 Nov

What should I implement next to my Rust Vulkan prototype?

It has plenty of occlusion potential, even though it's a sparse asteroid field of 1 million instances. Should be able to cull 90%+ easily...

I need the occlusion culling for efficient rendering of the sparse volume. Otherwise the brick raster results in overdraw. However the backfaces of SDF bricks terminate the root finding immediately as the ray starts from the inside. Could early out normal calc too...

Read 4 tweets

Sebastian Aaltonen

@SebAaltonen

12 Nov

@LottesTimothy

Most people missed this presentation by @LottesTimothy in GDC 2019:

gpuopen.com/gdc-presentati…

It was about image kernels and their memory access patterns. Filled with GCN architecture specifics, but the most noteworthy detail was the LDS sliding window algorithm.

Thread...

Blur kernels are very popular, and the most annoying part about writing one is how you avoid fetching the neighborhood again and again. Tiny changes in execution order can have massive effect in cache utilization. The problem is especially tricky in separable X/Y gaussian blurs.

Naive separable gaussian blur fetches a long strip along X axis. Each pixel does the same. Pixel Y and Y+n share zero input pixels with each other. Pixels along the X axis share inputs. But if the kernel is wide enough it's hard to keep all of that data reliably in caches.

Read 14 tweets

Sebastian Aaltonen

@SebAaltonen

11 Nov

1 million SDF cubes (950 MB SDF volume). Unoptimized baseline benchmark. Before the sparse data structure.

Running time around 150% in NSight (9.8 ms without).

Some analysis below...

PCI-E 4.0 bandwidth usage is now 6.9%. This was a bottleneck when my index buffer was in system memory. Measured 28 GB/s PCI-E 4.0 bandwidth with that setup. Albeit awesome numbers, putting this buffer to GPU memory made the simple cube test (no sphere trace) 5x faster...

L1$ hit rate hovers at around 90%. L2$ hit is around 60%. The first version didn't have any volume texture mip levels, and showed L1$ hit rate of 30% and L2$ hit rate of 20%.

The performance got 3x faster on RTX 3090 after that change. I can't see any visual difference.

Read 7 tweets

Sebastian Aaltonen

@SebAaltonen

11 Nov

Spent a day refactoring GPU mem code of my Vulkan prototype: Added AMD VMA and introduced new helper functions. Works perfectly on Intel GPU...

Nvidia GPU:
error: process didn't exit successfully: `target\debug\rust_test.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

No Vulkan validation layer warnings/errors. Need to start debugging what's wrong in my code :)

It was my bug... I wrote over the mapped GPU memory region. Was a simply copy-paste error of course :D

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Sebastian Aaltonen

Try unrolling a thread yourself!

More from @SebAaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Did Thread Reader help you today?

Like this author's thread?