What is the best way to provide GPU the data inputs for each draw call on every API: Vulkan, Metal and WebGL 2.0 (GLES 3.0). With some DX12 tidbits in the mix...
Let's start with the basics. Each draw call needs some constant parameters that apply for the whole draw call. Uniform buffers (constant buffers) is the most common way to pass this data. DX12 has root constants too and Vulkan has push constants...
Let's first separate the discussion about small per-draw temp "parameter" data and big possibly persisteny data such as light arrays and skinning matrix arrays, etc. Both of these inputs can use uniform buffers, but they are conceptually different.
In GLES 3.0+ the most efficient way to pass dynamic per-draw data (uniforms) is to bump allocate all uniforms into a big UBO and offset bind that big UBO per draw. This way there's no map/unmap calls per draw. Which is a big perf advantage.
WebGL 2.0 doesn't have map/unmap, so the client code can't directly write to a big bump allocated map/unmap temp buffer pointer. Instead we want to allocate a CPU buffer and call glBufferData once for the big buffer to upload it to GPU. You offset bind like in GLES 3.0.
Metal has efficient offset binding too. You can change the offset of bound buffer with this API:
Metal has persistent buffer binding, so you store a CPU pointer to GPU visible data once and use that every frame to write, directly from use land code.
The draw loop only needs setFragment/VertexBufferOffset call per draw to set the new draw call uniforms.
This kind of persistent mapping or map/unmap pointer allows you to emit draw calls from multiple threads and write their data to that GPU visible pointer directly.
Metal has other ways to write constants too. You can embed them to the command buffer or you can embed them to argument buffer directly (don't need to access though a pointer)...
Metal's setVertex/FragmentBytes has the flaw that it requires an extra copy. With persistently mapped bump allocated temp buffer, you can write to GPU visible memory directly without a copy, so that's preferable. Just change the binding offset per draw.
Embedding uniforms directly in argument buffer requires you to write the argument buffer dynamically. You either need to double buffer the argument buffers (to avoid CPU<->GPU race) or bump allocate them in the temp memory.
In Metal 2.0 you need MTLArgumentEncoder to write to argument buffers. So it's a bit hard to write dynamic uniforms directly to there directly in the user land code. It is basically just raw memory, but we can't access it that way. Thus we get an extra copy due to abstraction.
Metal 3.0 apparently makes the argument buffers act much more like raw memory, so we might actually be able to directly write them from user land code too. But Metal 3.0 doesn't work on iPhone 6s / iPad Air 2, so we can't yet commit on that. Only for newest devices.
Vulkan is next: Vulkan also supports persistently mapped buffers. Write (bump alloc to big buffer) directly all per-draw data to CPU visible GPU pointer from multiple threads. Vulkan descriptor sets support a dynamic buffer binding type, meaning that you can change the offset.
In Vulkan there's no independent buffer bindings, everything is inside descriptor sets. And Android min spec limitation is 4 descriptor sets. Also Vulkan requires you to rebind the whole descriptor set in order to change the dynamic buffer offsets...
Our approach is to separate the offset bound dynamic data to its own descriptor set. This set has only a couple of offset bound buffers. This means that only 3 decriptor sets are left for persistent data (at different bind frequencies). More about that later.
Alternative way in Vulkan is to use push constants. This is faster on some GPUs. Push constants behave similarly to DX12 root constants on these devices, and they are directly loaded by the GPU. But there's a size limitation.
On mobile GPUs push constants seems to be mostly emulated by a driver side bump allocated uniform buffer. Unlike Metal's setVertexBytes, Vulkan's vkCmdPushConstants allows partial update of push constants. Meaning that driver must do shadow copies. We likely get an extra copy.
Also similarly to Metal setVertexBytes, the push constants can't be written directly by user land code (running in multiple threads) to a GPU visible memory location. There must be an extra CPU side copy.
Push constants (and root constants in DX12) are very good for writing small data, such as the object index. In our use case we could use a push constant to provide the dynamic per-draw buffer start offset. This is good idea on PC and consoles, but not on current mobile GPUs.
Also many Android mobile GPUs prefer uniform buffers (UBO) over SSBOs. Uniform buffers bindings have maximum size of 64KB on PC/consoles and 16KB on Android min spec. Our 64MB temp allocated UBO can't be visible all at once. Push constant offset doesn't work.
We must use the offset binding API to change the buffer offset. This way we can make any 16KB (or 64KB) region of the big UBO visible per draw call. Push constant unfortunately can't do this. And push constant is emulate on mobile anyways, so that's a second strike against it.
But there's another way to set one parameter per draw. GLSL shaders have gl_BaseInstance system input, which allows us to pass one uint32 parameter to the shader for free. Draw call has baseInstance parameter, we provide the uint32 to draw and the shader sees it directly.
BaseInstance has the same limitations as uint32 push constant. If you can't use SSBO (WebGL 2.0 and GLES 3.X due to ARM Mali driver not supporting SSBO load in vertex shader), the UBO size limitation on mobile (16KB) means that you can't use it as a generic data offset alone.
But, we are sub-allocating N draws to the same massive UBO using a bump allocator already. Subsequent draws are next to each other in memory. We can change the UBO offset binding every 16KB and use gl_BaseInstance to index inside the 16KB region...
This works fine, and provides good performance (best one I measured), since the frequency of API calls to update draw data bindings is only once per 16KB. The problem is that gl_BaseInstance is not supported in all platforms...
Currently we have base instance support only in Metal (A9 and higher), Vulkan, GLES and OpenGL. DX11 and DX12 do not support it. WebGL 2.0 and WebGPU do not support it.
This concludes the per-draw (small + dynamic) inputs. But we are not done. We also must support dynamic lower frequency inputs for each render pass and persistent inputs, which we want to delta update on demands. I will write another thread for this topic.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
In many old rendeers dynamic draw data was the only data. If you wanted to provide global data such as camera matrices, time, sun light properies, etc, you simply copied this to the same uniform buffer with all the per draw-data. Adding a lot of extra CPU->GPU transfers.
For best performance, we want to separate this data from per-draw data and upload it once per render pass (or at some other lower frequency). This means that we have dynamic bump allocated data at two frequencies: draw and pass. Draw data is already discussed in the other thread.
Seems that I have to bite the bullet: Build a WebGL 2.0 backend for my new renderer.
This will require some API changes. I hoped I don't have to do it, but WebGPU is still not available. Red on every browser. On both mobile and desktop.
WebGL 2.0 is a GLES 3.0 with one major difference. There's no map/unmap at all.
My new renderer exposes persistently mapped buffers, which is the most advanced way to write to buffers. Map/unmap requires you to unmap before you can use the buffer...
But WebGL 2.0 doesn't even have map/unmap. You have to call gl.bufferData every time you want to set buffer data from CPU and it will replace the existing data. There's no way to get a pointer to GPU visible buffer, even for a short duration.
ID.3 (medium battery = 420km range) launch price was 35k$. This is 10k$ cheaper, has +30km better range and charges 2x faster. And it has a bigger trunk, but is still 20cm shorter. Acceleration is also slightly better (<7s, while ID.3 was 7.3s).
I didn't even count inflation to the price. ID.3 prices have increased by >5k€ from the original launch, so the real price difference is >15k€. This is very good if VW manages to launch at 25k€ price point. Chinese manufacturers will certainly try to match it however.
It's sad that you can't ask for maximum buffer alignment in Vulkan. I can ask alignment requirements for SSBO and UBO bind offsets, but not for vertex buffer offsets for example.
Required alignment for vertex buffers is 256 bytes on Nvidia 2080...
I need to know the maximum buffer alignment requirement. Currently my hack for this is to create a buffer (with no bound memory) with all the usage flags. I am assuming this is the worst case alignment.
Is there any better way to do it in Vulkan?
There's properties for these:
minMemoryMapAlignment
minStorageBufferOffsetAlignment
minTexelBufferOffsetAlignment
minUniformBufferOffsetAlignment
optimalBufferCopyOffsetAlignment
optimalBufferCopyRowPitchAlignment
"i" is the standard variable name for the loop index. I will continue using it to keep my code easy to read.
But when loops get too long (can't see the loop start in the same screen) you might want to consider extracting the loop body or using a more descriptive variable name.
Also I don't really like i,j inner loops. If I have more levels than one I prefer to use more descriptive names, unless the loop is just a few lines long and it's trivial to read it. Also I use x,y,z in loops that are iterating over a spatial data structure such as a texture.
Having a lot of loop levels is a compelling reason to extract code too. Then you can use i on both loops, and name the outer i that you pass to the inner function properly in the function signature.