LaunchJob() schedules a job. Job is eligible for execution when the array of existing jobs have been finished. Job scheduler only schedules jobs if they don't have resource conflicts. 1x ReadWrite or multiple ReadOnly for each resource...
Resources are 64 bit void pointers. Job body is declared as a lambda function. It can capture data from the scope. It's common to capture the resources and some other data by value. Can also capture pointer to temp allocated storage (linear frame allocator).
If you want to run an ECS system on top of this, each component table in the ECS system would be a resource (64 bit pointer to an object). ReadWrite access ensures that only one job can run at once modifying the same component table. Multiple ReadOnly accesses run concurrently.
There's no mutexes/semaphores in this system. You use resource accesses to control what can run concurrently. User land code has no naked sync primitives and never stalls waiting for a mutex. And never deadlocks.
Jobs can spawn other jobs. If you want to elevate your resource access, you need to spawn a continuation job with different set of ReadOnly/ReadWrite resources. Scheduler runs that job once the resources are free to use.
This design doesn't need a main thread or render thread at all. It's actually preferable to avoid the main thread completely, as that solves all main thread bottlenecks elegantly. You could have a job that spawns the main jobs needed for a frame, and then it spawns itself again.
For convenience there would be a function to combine job dependencies. Like this:
The job that starts next frame and kicks all next frame jobs would depend on frameFinished.
The low level scheduler has one worker thread per core. Threads execute jobs from their queue. Job stealing is lockfree. Threads steal half of the jobs from other thread when their own job queue runs out of work (lazy binary splitting algorithm).
Resources and dependencies are solved in high level scheduler which runs when a thread fails to steal work (nothing is left). The thread finishing first runs the dependency solver and fills the work stealing queues of all threads. So they can run ASAP. This way we avoid a bubble.
That's it basically. I don't think we need anything more complex than that. Implementation should be <1000 lines of standard C/C++ code.
Some technical tidbits:
The examples use initializer_lists to provide the dependencies and resources. This is enabled by using slice/span abstraction in the interface. No memory allocs are required to provide these variable length datas.
C/C++ guarantees that temporary objects used in function parameter list live until the function body returns. This means that initializer_list is safe to pass to a slice/span function parameter.
We don't use std::function to store the lambda, since std::function allocates memory (small size optimization only covers 64 bits IIRC). Instead we use a custom function implementation that doesn't allocate.
The dependency/resource arrays and the lambda captured variables are stored in a big ring buffer inside the scheduler. This is just a bump allocator (offset += size). Scheduling and executing jobs doesn't allocate any memory (no malloc/new).
It's a very simple and efficient algorithm. Split the remaining work to half. That's it basically :)
Clarification: The scheduled jobs are stored immediately to the ring buffer and the function returns to caller. The execution of the jobs starts once the higher level scheduler has resolved the depedencies (work queues are empty) and jobs are pushed to per-thread queues.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Let's have a thought experiment: All future gaming CPUs will be like 5800X3D. CPU vendors will ensure that game working sets fit in LLC. You pay main memory bandwidth only when temporal coherence is not perfect.
Let's start with the most discussed OOP performance flaws: Pointer chains and partial cache lines. Pointer targets are in LLC since they were accessed also last frame. Partial cache line reads are not as bad as the remaining data will be read later this frame (still in cache).
You still pay extra cost for pointer access and partial cache line reads. LLC is slower than L1$, but the extra cost is now much more manageable. Processing 1 object at a time also requires more setup and can't be vectorized efficiently. These problems remain.
A modern gaming CPU such as the 5950X has achievable memory bandwidth of 35 GB/s. At 144 fps that's 35 GB/s / 144 fps = 243 MB/frame. That's how much unique memory you can access in 1 frame.
5800X3D has 96 MB of LLC. Almost the whole working set fits to the cache. Thread...
That 243 MB/frame figure is highly optimistic. It assumes that everything your run on the CPU is bandwidth bound, and it assumes you never access the same data twice. It's common to produce the data first and consume later in the frame. You access the same cache lines twice.
So it's likely that we already have games that fit their whole working set in the 96 MB cache of 5800X3D, assuming the game runs at 144 fps of course. Since games are highly temporally coherent, there's only a few percent change in the working set between the frames.
Subpass test: RGBA8 lit buffer + 3x RGBA8 G-buffers (filled with RGB) + "lighting shader" (=shows different G-buffer for each 16x16 region) + forward transparency on top. All in the 1 renderpass (using Vulkan subpasses).
60 fps (low end Androids) and all devices are still cool.
The Xiaomi Redmi phone in the middle doesn't seem to have FPS counter in developer menu. Also I can't find a way to increase display shut down timer on that device.
Will add G-buffer decals next (blended to 3x MRT) and more transparencies to add overdraw. Hopefully it's still 60 fps at that point.
Then I add loops to the triangle draw calls to simulate increased load from increased draw call counts (g-buffer, decals, transparencies).
I would really like to have VRS on mobile, because it allows you to render at multiple resolutions in the same tile buffer, without having to resolve data to main memory.
On Xbox 360 we experimented with rendering particles using 4xMSAA on top of 1 sample buffer...
We just aliased a 4xMSAA target on top of the 1 sample target in the EDRAM and rendered particles this way. Same for Z buffer. It kind of worked. But the pattern unfortunately was messy 4xMSAA pixels didn't form nice 2x2 quads in screen space. Z test also worked :)
4xMSAA particles were faster than half res particles, and had pixel perfect Z test. Didn't need an extra resolve + combine for half res particles. Nowadays with VRS you can finally do the same in a proper way.
Decima Engine (Horizon Forbidden West) is using the same XOR trick I described few months ago in my V-buffer SV_PrimitiveID Twitter threads. We of course found it independently.
PC implementation uses SM 6.1 GetAttributeAtVertex.
This technique is good because it works on hardware that doesn't guarantee leading vertex order. XOR is order independent.
I measured that the performance of this technique was identical to baseline (no primitive ID), so it's as fast as it gets. Unfortunately it requires SM 6.1.
After that I posted several threads with performance analysis of various V-buffer SV_PrimitiveID approaches. Leading vertex is the best fit for HW that guarantees leading vertex order. Otherwise you want to use the XOR trick (SM 6.1).
Let's talk about storing vertices in optimal memory footprint.
I have seen people using full fat 32 bit floats in their vertex streams, but that's a big waste, especially on mobiles. All our 60 fps Xbox 360 used a 24 byte layout with a mixture of 16 bit and 10 bit data.
The most common optimization is to avoid storing the bitangent and reconstructing it with cross product. However it's worth noting that UV mirroring causes the bitangent sign to flip. You need to store the mirror bit somewhere. RGB10A2 offers you a nice 2 bit alpha for the sign.
10 bits per channel is enough for the normal/tangent/bitangent. Some visually impressive last gen AAA games have shipped with rgb888 normal in the G-buffer. If that's enough for your mobile game, then RGB10 is enough in your vertex for all of these 3.