The GE8320 Z-acne issue is likely caused by 24 bit Z buffer + some missing Z bias state in my code. Which causes shadow maps to Z acne with the surface. Will investigate that. GE8320 doesn't support 32+8 Z+stencil. Have to use 24+8.
Will do proper profiling later with actual GPU tools. I just got Android working today, and running initial tests to see that things work correctly. I am glad that there's no perf regressions on low end.
Fixed the GE8320 issue. It was actually interpolation precision issue with material ID. Fixed it and it renders correctly. No perf impact.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Currently I invalidate my handles (bump generation index) when the gfx resource is destroyed. The resource itself is put into a delete queue waiting until GPU has finished that frame.
I am considering deferring the handle invalidation too...
Currently I push all passes and draw commands to big queues and these queues are processed later. Will be done in threads. It's convenient to be able to delete resources (render targets, passes, textures, buffers) before the handles are dereferenced.
Currently you can't delete the resources (handles) immediately, as the deferred rendering will try to deref the handles and it gets a null since generation bits don't match. But the resource is still there, because of deferred deletion.
This is what happens when you try to do two refactorings at the same time. I was lazy and tried to save some time :)
The following Metal object is being destroyed while still required to be alive by the command buffer 0x134808e00: [...] label = CAMetalLayer Drawable
Everything works fine on Metal. It's just that Vulkan backend (running here on top of MoltenVK) broke. I also merged all my Vulkan changes that made it run on Android phones AND refactored the draw stream bindings. You get what you ask for :)
Now I have a separate API for starting the main display pass. It returns the swap chain render pass handle and a command buffer handle.
Thread...
On Metal starting the main display pass acquires the drawable. Acquiring the drawable on Metal causes a CPU stall if the swap chain buffer is not available, so you need to do it as late as possible. This way offscreen passes can be pushed to GPU before the CPU stall.
Presenting the display is also a command in Metal. It needs to be pushed to the command buffer. Now I don't have a separate present API anymore. The main command buffer submit will write a present command at end of the command buffer automatically.
But my implementation is dead simple. I have a fence between all render passes. I allow next render pass to run vertex shaders before the previous pass finishes. This is the biggest optimization you want on mobile TBDR GPUs.
But this is not yet shippable. Currently HypeHype doesn't sample render target textures in vertex shader, but somebody could implement a shader like that and my dead simple fence implementation would fail.
I guess it's time for another uint64 bitfield for used render targets. Store that in bind groups and render passes. I could have an array of 64 fences. Update a fence at end of render pass for each RT that has store op != don't care. Wait when used next time.
Based on feedback, it seems that nobody is complaining about the allocator algorithmic details or code clarity, but people are complaining about these two delete[] calls :)
Not going to include std::unique_ptr header to remove two lines of trivial code.
Yes, I know that I must implement a move constructor. That's extra 4 lines of code.
But this code isn't even tested yet. Going to do that once I have written a test suite first and fixed all the bugs. Those things have to wait.
I have bad experienced using single header libraries that lean a lot on std headers. Magic enum, phmap and similar add a massive cost to the compile time due to the dependency to complex std headers. I made HypeHype compile time 2x faster last summer by cutting std dependencies.
This is the memory allocator comparison paper I mentioned in my threads. My allocator should be similar to TLSF since I also use a two level bitfield and floating point distribution for bins. But I don't 1:1 tie bitfield level with float mantissa/exp.
I haven't read the TLSF paper or implementation, so they might have some additional tricks that I didn't implement. Also my allocator doesn't embed the metadata to the allocated memory, since my allocator is not a memory allocator, it's an offset allocator. No backing memory.
This means that I must use a separate array to store my nodes and freelist. Which is both a good thing and a bad thing. The good thing is that you can use this to allocate GPU memory or any other type of resource that requires sequential slot allocation.