Tweet

Sebastian Aaltonen

Mar 19 • 27 tweets • 5 min read

Thread: CPU->GPU data transfer.

What is the best way to provide GPU the data inputs for each draw call on every API: Vulkan, Metal and WebGL 2.0 (GLES 3.0). With some DX12 tidbits in the mix...

Let's start with the basics. Each draw call needs some constant parameters that apply for the whole draw call. Uniform buffers (constant buffers) is the most common way to pass this data. DX12 has root constants too and Vulkan has push constants...

Let's first separate the discussion about small per-draw temp "parameter" data and big possibly persisteny data such as light arrays and skinning matrix arrays, etc. Both of these inputs can use uniform buffers, but they are conceptually different.

In GLES 3.0+ the most efficient way to pass dynamic per-draw data (uniforms) is to bump allocate all uniforms into a big UBO and offset bind that big UBO per draw. This way there's no map/unmap calls per draw. Which is a big perf advantage.

WebGL 2.0 doesn't have map/unmap, so the client code can't directly write to a big bump allocated map/unmap temp buffer pointer. Instead we want to allocate a CPU buffer and call glBufferData once for the big buffer to upload it to GPU. You offset bind like in GLES 3.0.

Metal has efficient offset binding too. You can change the offset of bound buffer with this API:

developer.apple.com/documentation/…

Metal has persistent buffer binding, so you store a CPU pointer to GPU visible data once and use that every frame to write, directly from use land code.

The draw loop only needs setFragment/VertexBufferOffset call per draw to set the new draw call uniforms.

This kind of persistent mapping or map/unmap pointer allows you to emit draw calls from multiple threads and write their data to that GPU visible pointer directly.

Metal has other ways to write constants too. You can embed them to the command buffer or you can embed them to argument buffer directly (don't need to access though a pointer)...

Metal's setVertex/FragmentBytes has the flaw that it requires an extra copy. With persistently mapped bump allocated temp buffer, you can write to GPU visible memory directly without a copy, so that's preferable. Just change the binding offset per draw.

developer.apple.com/documentation/…

Embedding uniforms directly in argument buffer requires you to write the argument buffer dynamically. You either need to double buffer the argument buffers (to avoid CPU<->GPU race) or bump allocate them in the temp memory.

In Metal 2.0 you need MTLArgumentEncoder to write to argument buffers. So it's a bit hard to write dynamic uniforms directly to there directly in the user land code. It is basically just raw memory, but we can't access it that way. Thus we get an extra copy due to abstraction.

Metal 3.0 apparently makes the argument buffers act much more like raw memory, so we might actually be able to directly write them from user land code too. But Metal 3.0 doesn't work on iPhone 6s / iPad Air 2, so we can't yet commit on that. Only for newest devices.

Vulkan is next: Vulkan also supports persistently mapped buffers. Write (bump alloc to big buffer) directly all per-draw data to CPU visible GPU pointer from multiple threads. Vulkan descriptor sets support a dynamic buffer binding type, meaning that you can change the offset.

In Vulkan there's no independent buffer bindings, everything is inside descriptor sets. And Android min spec limitation is 4 descriptor sets. Also Vulkan requires you to rebind the whole descriptor set in order to change the dynamic buffer offsets...

Our approach is to separate the offset bound dynamic data to its own descriptor set. This set has only a couple of offset bound buffers. This means that only 3 decriptor sets are left for persistent data (at different bind frequencies). More about that later.

Alternative way in Vulkan is to use push constants. This is faster on some GPUs. Push constants behave similarly to DX12 root constants on these devices, and they are directly loaded by the GPU. But there's a size limitation.

On mobile GPUs push constants seems to be mostly emulated by a driver side bump allocated uniform buffer. Unlike Metal's setVertexBytes, Vulkan's vkCmdPushConstants allows partial update of push constants. Meaning that driver must do shadow copies. We likely get an extra copy.

Also similarly to Metal setVertexBytes, the push constants can't be written directly by user land code (running in multiple threads) to a GPU visible memory location. There must be an extra CPU side copy.

Push constants (and root constants in DX12) are very good for writing small data, such as the object index. In our use case we could use a push constant to provide the dynamic per-draw buffer start offset. This is good idea on PC and consoles, but not on current mobile GPUs.

Also many Android mobile GPUs prefer uniform buffers (UBO) over SSBOs. Uniform buffers bindings have maximum size of 64KB on PC/consoles and 16KB on Android min spec. Our 64MB temp allocated UBO can't be visible all at once. Push constant offset doesn't work.

We must use the offset binding API to change the buffer offset. This way we can make any 16KB (or 64KB) region of the big UBO visible per draw call. Push constant unfortunately can't do this. And push constant is emulate on mobile anyways, so that's a second strike against it.

But there's another way to set one parameter per draw. GLSL shaders have gl_BaseInstance system input, which allows us to pass one uint32 parameter to the shader for free. Draw call has baseInstance parameter, we provide the uint32 to draw and the shader sees it directly.

BaseInstance has the same limitations as uint32 push constant. If you can't use SSBO (WebGL 2.0 and GLES 3.X due to ARM Mali driver not supporting SSBO load in vertex shader), the UBO size limitation on mobile (16KB) means that you can't use it as a generic data offset alone.

But, we are sub-allocating N draws to the same massive UBO using a bump allocator already. Subsequent draws are next to each other in memory. We can change the UBO offset binding every 16KB and use gl_BaseInstance to index inside the 16KB region...

This works fine, and provides good performance (best one I measured), since the frequency of API calls to update draw data bindings is only once per 16KB. The problem is that gl_BaseInstance is not supported in all platforms...

Currently we have base instance support only in Metal (A9 and higher), Vulkan, GLES and OpenGL. DX11 and DX12 do not support it. WebGL 2.0 and WebGPU do not support it.

WebGL 2.0 extension (coverage unknown):
github.com/KhronosGroup/W…

This concludes the per-draw (small + dynamic) inputs. But we are not done. We also must support dynamic lower frequency inputs for each render pass and persistent inputs, which we want to delta update on demands. I will write another thread for this topic.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @SebAaltonen

Sebastian Aaltonen

@SebAaltonen

Mar 19

https://twitter.com/SebAaltonen/status/1637417126255271936?s=20

Thread: CPU->GPU data transfer part 2: lower frequency dynamic data and persistent data.

Per-draw dynamic data options are discussed here:

https://twitter.com/SebAaltonen/status/1637417126255271936?s=20

Thread...

In many old rendeers dynamic draw data was the only data. If you wanted to provide global data such as camera matrices, time, sun light properies, etc, you simply copied this to the same uniform buffer with all the per draw-data. Adding a lot of extra CPU->GPU transfers.

For best performance, we want to separate this data from per-draw data and upload it once per render pass (or at some other lower frequency). This means that we have dynamic bump allocated data at two frequencies: draw and pass. Draw data is already discussed in the other thread.

Read 44 tweets

Sebastian Aaltonen

@SebAaltonen

Mar 17

Seems that I have to bite the bullet: Build a WebGL 2.0 backend for my new renderer.

This will require some API changes. I hoped I don't have to do it, but WebGPU is still not available. Red on every browser. On both mobile and desktop.

caniuse.com/webgpu

Thread...

WebGL 2.0 is a GLES 3.0 with one major difference. There's no map/unmap at all.

My new renderer exposes persistently mapped buffers, which is the most advanced way to write to buffers. Map/unmap requires you to unmap before you can use the buffer...

registry.khronos.org/webgl/specs/la…

But WebGL 2.0 doesn't even have map/unmap. You have to call gl.bufferData every time you want to set buffer data from CPU and it will replace the existing data. There's no way to get a pointer to GPU visible buffer, even for a short duration.

Read 34 tweets

Sebastian Aaltonen

@SebAaltonen

Mar 17

I am debugging why taking screenshots stalls for a long time. I am reading the data as single bytes (R,G,B,A) and swizzling them to another target.

Reading write combined memory as single bytes doesn't seem like a good idea :D

This is the code. This takes a second to complete on a single 1920x1080p image :)

memcpy first and then swizzle = instant.

Read 4 tweets

Sebastian Aaltonen

@SebAaltonen

Mar 17

The VW ID.2 looks very promising:

Price = 25k€, 450 km range and 490 liter trunk. Charges 10%->80% in 20 minutes.

Pretty much fixes all the problems with current cheap EVs. Range is acceptable, charges fast and has enough room.

insideevs.com/news/657415/vo…

ID.3 (medium battery = 420km range) launch price was 35k$. This is 10k$ cheaper, has +30km better range and charges 2x faster. And it has a bigger trunk, but is still 20cm shorter. Acceleration is also slightly better (<7s, while ID.3 was 7.3s).

I didn't even count inflation to the price. ID.3 prices have increased by >5k€ from the original launch, so the real price difference is >15k€. This is very good if VW manages to launch at 25k€ price point. Chinese manufacturers will certainly try to match it however.

Read 5 tweets

Sebastian Aaltonen

@SebAaltonen

Mar 17

It's sad that you can't ask for maximum buffer alignment in Vulkan. I can ask alignment requirements for SSBO and UBO bind offsets, but not for vertex buffer offsets for example.

Required alignment for vertex buffers is 256 bytes on Nvidia 2080...

I need to know the maximum buffer alignment requirement. Currently my hack for this is to create a buffer (with no bound memory) with all the usage flags. I am assuming this is the worst case alignment.

Is there any better way to do it in Vulkan?

There's properties for these:
minMemoryMapAlignment
minStorageBufferOffsetAlignment
minTexelBufferOffsetAlignment
minUniformBufferOffsetAlignment
optimalBufferCopyOffsetAlignment
optimalBufferCopyRowPitchAlignment

But nothing for vertex buffers.

vulkan.gpuinfo.org/listproperties…

Read 5 tweets

Sebastian Aaltonen

@SebAaltonen

Mar 17

https://twitter.com/akash_webdev/status/1635864283333664769

"i" is the standard variable name for the loop index. I will continue using it to keep my code easy to read.

But when loops get too long (can't see the loop start in the same screen) you might want to consider extracting the loop body or using a more descriptive variable name.

https://twitter.com/akash_webdev/status/1635864283333664769

Also I don't really like i,j inner loops. If I have more levels than one I prefer to use more descriptive names, unless the loop is just a few lines long and it's trivial to read it. Also I use x,y,z in loops that are iterating over a spatial data structure such as a texture.

Having a lot of loop levels is a compelling reason to extract code too. Then you can use i on both loops, and name the outer i that you pass to the inner function properly in the function signature.

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Sebastian Aaltonen

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @SebAaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!