Let's talk about fast draw calls on bottom 50% mobile devices. And let's ignore the bottom 10% so that we can assume Vulkan 1.0 support = compute shaders.
Why are we not doing GPU-driven rendering? Why instancing is not always a win?
The first thing I want to talk about is memory loads on old GPUs. Before Nvidia Turing and AMD RDNA1 all buffer loads on these vendors were going though the texture samplers. Texture samplers have high latency (~100 cycles). The only exception was uniform buffer & scalar loads.
In Pascal and below 32 bit RGBA unfiltered texture load was half rate. If you load a 4x4 fp32 matrix from memory, you basically pay equivalent TMU cost of sampling eight RGBA8 textures. And in addition to this, you need 16 vector registers to store the load results.
On older Nvidia GPUs, you wanted to lean on uniform buffers, because each SM had a very fast uniform cache, which was almost as fast as registers.
On AMD GCN if you memory access is uniform (known at compile time), the shader compiler emits a scalar load. Scalar loads use special scalar registers and special (very fast) scalar cache. This way you don't waste precious vector registers for your fat 4x4 matrices.
AMD used to have similar constant buffer hardware too. In their first DX11 / compute shader capable GPU (Radeon 5870), accessing constant buffer was order of magnitude faster than accessing groupshared memory. I ran some benchmarks back then.
But constant buffer hardware with tiny caches has bad performance issues when the address was not uniform over all lanes. It runs an inner loop and in the worst case the cache misses, causing big stalls.
On Xbox 360 our skinning shaders were indexing an array of matrices in a uniform buffer. We only had 32 bones, but we wasted over 1 millisecond per frame due to "constant waterfalling". There was a special loop counter that allowed fast indexing in very limited cases.
Ray-tracing was an important use case that forced Nvidia and AMD to improve their raw memory load paths. Ray-tracing can't lean on precalculated UBO addresses, and it can't lean on VS->PS attribute interpolation hardware (on-chip buffer) to deliver the triangle data.
90% of mobile GPUs currently in the market (people use their phones for 2-4 years) do not have similar raw buffer load optimizations. Their design is still similar to Xbox 360. Using UBOs for big data (such as transform matrices) is crucial for best performance.
Bottom 50% mobile GPUs also have tiny register files. You don't want to waste vector registers. As a result, you want to load your transform matrices and material properties from a fixed address in an uniform buffer. This is the only way to get optimal performance on low end.
You can still put all your scene data in a big GPU buffer and keep it persistently in GPU memory. UBOs don't themselves have a size limit, UBO bindings have a size limit. You can allocate a 100MB UBO, but you can only bind 16KB window of it using a single binding.
Changing the buffer binding offset in CPU side is fast and can be done in many ways. You can even change the instance base offset to provide a compile time constant to the shader (Vulkan and Metal only). This way your indexing is not dynamic indexing and codegen is fast.
Using instance index always forces the shader compiler to assume the index is dynamic. Even if you would render single instance draw calls. This means that you pay 16 vector registers per 4x4 matrix, and you must use slow memory loads (possibly going though the sampler).
Similarly in GPU driven rendering, you must calculate the cluster index -> object index in the shader. This is not compile time uniform. Thus you get slow memory loads and waste vector registers for loading those big 4x4 matrices.
Additionally, in GPU-driven rendering, especially when using the V-buffer, you must load the geometry and instance data manually in the pixel shader. 3 times per pixel. This is a massive amount of slow memory loads per pixel. It's equivalent of sampling 20+ textures per pixel!
In addition to this, deferred texturing is slow on older hardware. You need analytic gradients with deferred texturing. The added ALU is not a big problem, but these GPUs have often 1/8 rate or even slower SampleGrad. Ray-tracing needs fast SampleGrad, but these are not RT GPUs.
Thus our solution for fast rendering on low/mid tier Android phones (50% of the market today) is to lean on traditional draw calls. Make draw calls as fast as possible on both CPU and GPU side. Well optimized solution can push 10,000 draw calls at 60 fps on 99$ Android phone.
To clarify: I am talking about 10,000 draw calls each with different unique mesh and different material (different texture bindings). Changing the PSO at high frequency is of course out of the question. We optimize for low PSO count and bin by the PSO. Mobile has HW Z sort.
The another topic about mobile performance I wanted to talk about is related to the TBDR, tile memory and framebuffer compression. Memory bandwidth is limited and using it consumes a lot of energy (hot device & drains battery).
Xbox 360 had very limited 22.4 GB/s memory bandwidth back in the day. Shared DDR memory between the CPU and GPU. But there was a fast EDRAM for the render target / z-buffer purposes. Mobile tile memory serves the same purpose.
A mobile GPU binds all triangles to tiles and then renders one tile at a time. All color render targets (including MRTs) and Z-buffer of that tile are in fast on-chip memory. Overdraw is cheap, because you don't need memory roundtrips over and over again.
Blending is also cheap and mobile GPUs can do fully programmable blending. They can just read the existing tile memory pixel in the next shader. This can be used to implement on-chip G-buffer rendering + lighting for example. G-buffers don't hit the main memory at all.
Resolving tiles (or EDRAM render target on Xbox 360) to main memory is a slow operation. And reading them back (using texture sampler) from the main memory is also slow. So you want to minimize the amount of resolve operations (= render passes).
Mobile GPUs do have lossless framebuffer compression, which reduces the bandwidth usage for tile resolves. Newest mobile GPUs also have (slightly) lossy framebuffer compression, which further reduces the bandwidth cost.
The existence of tile memory (= cheap overdraw) and framebuffer compression means that you want to use traditional VS + PS render passes instead of full screen compute passes. Since compute pays full bandwidth cost (no compression) and not all mobile GPUs have groupshared memory.
Also if you write using atomics, there's additional performance concerns on mobile GPUs. Nvidia and AMD have been prioritizing compute performance for years on desktop/server. They used to have slow atomics too, but nowadays their atomics are fast enough for Nanite SW raster.
As a result, the optimal way to use mobile HW is to lean on the power optimized TBDR hardware and write code to maximize this design. On PC/console, you instead want to lean on fast atomics, fast raw mem loads, flexible indirect drawing and write your own GPU-driven renderer.
IMPORTANT: I am discussing about bottom 50% of Vulkan capable the mobile devices here. Which is super important market. Latest flagship GPUs have already started improving on these things, and when ray-tracing hits mobile GPUs, we will see further improvements.
But it will take at least 2-3 years until GPU-driven rendering is the most optimal way of rendering on the majority of the phones. We will of course ship a GPU-driven renderer for mobiles and for web as soon as that happens. But even today GE8320 is selling a lot (in 99$ phones).
@NOTimothyLottes This is why mobile games today don't yet look as good as the best Xbox 360 and PS3 games. People wrote highly optimized code solely for those platforms, and for current mobile phones, they reuse old code written for desktops, which wasn't designed for the HW or the gfx API.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Splatting gaussians instead of ray-marching. Reminds me of particle based renderer experiments. Interesting to see whether gather or scatter algos win this round.
C++20 designated initializers, C++11 struct default values and a custom span type (with support for initializer lists) is a good combination for graphics resource creation:
Declaring default values with C++11 aggregate initialization syntax is super clean. All the API code you need is this struct. No need to implement builders or other code bloat that you need to maintain.
C++20 span type doesn't support initializer lists, so you have to create your own. This is because initializer list life time is very short. Easy to use a dead list. I use "const &&" in the resource creation APIs to force a temporary object.
Managed to generate binding ids for the generated GLSL shader for GLES3/WebGL2 using SPIRV-Cross API.
GLES doesn't have sets, so I must generate a flat contiguous range of binding ids per set and store the set start index. Runtime binds N slots at a time (bind groups).
I also must dump a mapping table for combined samplers in the shader. Our renderer has separate samplers and images.
Our textures and bind groups are both immutable. So I can just store 2x GLint per combined sampler. This is 64 bits per combined sampler. Easy to offset allocate them all in a big buffer. Bind group has start offset and count.
I was talking about the new DOTS hybrid renderer GPU persistent data model 2 years ago at SIGGRAPH. We calculated object inverse matrices in the data upload shader, because that was practically free. ALU is free in shaders that practically just copy data around.
On mobile memory bandwidth is a big bottleneck, and using it wastes a lot of power. Thus I prefer to pack my data and unpack in shader. That's usually just a few extra ALU, but you get big bandwidth gains. Performance improves and perf/watt improves.
Let's design a fast screen tile based local light solution for mobile and WebGL 2.0 (no compute). Per-object light list sounded good until I realized that we have a terrain. Even the infinite ground plane is awkward to lit with per-object light list.
Thread...
No SSBOs. Uniform buffers are limited to 16KB (low end Android limitation). Up to 256 lights visible at once. Use the same float4 position + half4 color + half4 direction + cos angle setup that handles both point lights and directional lights. 32B * 256 lights = 8KB light array.
In addition to the light array we have a screen space light visibility grid. uint4 (16 bytes) per element as that's the minimum alignment for UBO arrays. If we use 64x64 tiles we fit the light grid to a 16KB UBO on all mobile resolutions.
I am implemented practically all the possible local light rendering algorithms during my career, yet I am considering trivial per-object list list for HypeHype.
Kitbashed content = lots of small objects. Granularity seems fine.
Thread...
Setup all the visible local light source data into a UBO array at beginning of the render pass. For each object, uint32 contains four packed light (8 bit) light indices. In the beginning of the light loop, do binary AND to take lowest 8 bits and shift down 8 bits (next light).
This is just a single extra uint per draw call. Setup cost is trivial. Assuming of course that it's fine to limit the light count to 4. We can of course use multiple uints if we want 8 or more lights per object. Not a problem.