Managed to generate binding ids for the generated GLSL shader for GLES3/WebGL2 using SPIRV-Cross API.
GLES doesn't have sets, so I must generate a flat contiguous range of binding ids per set and store the set start index. Runtime binds N slots at a time (bind groups).
I also must dump a mapping table for combined samplers in the shader. Our renderer has separate samplers and images.
Our textures and bind groups are both immutable. So I can just store 2x GLint per combined sampler. This is 64 bits per combined sampler. Easy to offset allocate them all in a big buffer. Bind group has start offset and count.
GLES doesn't have generic buffer bindings like modern APIs. Have to bind different types of buffers separately. That's a bit annoying. We need to have a bit of metadata also in the buffer bindings to be able to bind them correctly.
I am confident I can store bind group bindings in a tight space and write optimal binding code. But it's not going to be as fast as Vulkan and Metal, since GLES requires separate binding call per resource.
But it should be faster than traditional renderers requiring multiple software command buffer calls. We just write one uint32 to our software command buffer when we bind a group, even in GLES. And the group likely fits in one cache line. It's basically 100% GLES driver overhead.
NOTE: In the original image, there's the same binding index for the UBO and the first texture. This is intentional. GLES binding indices are type specific. Vulkan and Metal instead have shared binding slots for all resources.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Splatting gaussians instead of ray-marching. Reminds me of particle based renderer experiments. Interesting to see whether gather or scatter algos win this round.
C++20 designated initializers, C++11 struct default values and a custom span type (with support for initializer lists) is a good combination for graphics resource creation:
Declaring default values with C++11 aggregate initialization syntax is super clean. All the API code you need is this struct. No need to implement builders or other code bloat that you need to maintain.
C++20 span type doesn't support initializer lists, so you have to create your own. This is because initializer list life time is very short. Easy to use a dead list. I use "const &&" in the resource creation APIs to force a temporary object.
I was talking about the new DOTS hybrid renderer GPU persistent data model 2 years ago at SIGGRAPH. We calculated object inverse matrices in the data upload shader, because that was practically free. ALU is free in shaders that practically just copy data around.
On mobile memory bandwidth is a big bottleneck, and using it wastes a lot of power. Thus I prefer to pack my data and unpack in shader. That's usually just a few extra ALU, but you get big bandwidth gains. Performance improves and perf/watt improves.
Let's design a fast screen tile based local light solution for mobile and WebGL 2.0 (no compute). Per-object light list sounded good until I realized that we have a terrain. Even the infinite ground plane is awkward to lit with per-object light list.
Thread...
No SSBOs. Uniform buffers are limited to 16KB (low end Android limitation). Up to 256 lights visible at once. Use the same float4 position + half4 color + half4 direction + cos angle setup that handles both point lights and directional lights. 32B * 256 lights = 8KB light array.
In addition to the light array we have a screen space light visibility grid. uint4 (16 bytes) per element as that's the minimum alignment for UBO arrays. If we use 64x64 tiles we fit the light grid to a 16KB UBO on all mobile resolutions.
I am implemented practically all the possible local light rendering algorithms during my career, yet I am considering trivial per-object list list for HypeHype.
Kitbashed content = lots of small objects. Granularity seems fine.
Thread...
Setup all the visible local light source data into a UBO array at beginning of the render pass. For each object, uint32 contains four packed light (8 bit) light indices. In the beginning of the light loop, do binary AND to take lowest 8 bits and shift down 8 bits (next light).
This is just a single extra uint per draw call. Setup cost is trivial. Assuming of course that it's fine to limit the light count to 4. We can of course use multiple uints if we want 8 or more lights per object. Not a problem.
The high level agenda of my presentation looks currently like this. Each of the main topics have a lot of sub-topics of course.
If you find anything missing that you would want to hear about, please reply in the thread.
Correction: Backend doesn't process or setup data
The platform specific backend code just passes handles and offsets around, so that the data provided directly by the user land code is visible in the shaders. Zero copies and no backend refactoring when data layout changes.
IMPORTANT: The scope of this presentation is the low level gfx platform abstraction. The higher level rendering pipeline / algorithm code is out of the scope. I will be talking about that later of course. And that presentation is going to have a lot of pretty pixels too.