Let's talk about rendering a massive set of cubes efficiently...
Geometry shaders and instanced draw are unoptimal choices. Geometry shader outputs strips (unoptimal topology) and it needs GPU storage and load balancing. Instanced draw is suboptimal on many older GPUs for small 8 vertex / 12 triangle instances.
The most efficient way to render a big amount of procedural cubes on most GPUs is the following: At startup you fill an index buffer with max amount of cubes. 3*2*6 = 36 indices each (index data = 0..7 + i*8). Never modify this index buffer at runtime.
No vertex buffer. You use SV_VertexId in the shader. Divide it by 8 (bit shift) to get cube index (to fetch cube position from an array). The low 3 bits are XYZ bits (see OP image). LocalVertexPos = float3(X*2-1, Y*2-1, Z*2-1). This is just a few ALU in the vertex shader.
Use indirect draw call to control the drawn cube count from GPU side. Write the (visible) cube data (position or transform matrix, or whatever you need) to an array (UAV). This is indexed in the vertex shader (SRV).
There's an additional optimization: Only 3 faces of a cube can be visible at once. Instead generate only 3*2*3=18 indices per cube (positive corner). Calculate vec from cube center to camera. Extract XYZ signs. Flip XYZ of the output vertices accordingly...
If you flip odd number of vertex coordinates, the triangle winding will get flipped. Thus you need to fix it (read triangle lookup using "2-i") if you have 1 or 3 flips. This is just a few extra ALU too. Result = you are rendering 6 faces per cube, thus saving 50% triangle count.
The index buffer is still fully static even with this optimization. It's best to keep the index numbering 0..7 (8) per cube even with this optimization to be able to use bit shift instead of integer divide (which is slow). GPU's do index dedup. Extra "slot" doesn't cost anything,
Discard in a shader seems innocent, but that makes the GPU driver do crazy shit. Even if you branch out the discard, the driver must be prepared for it, because it doesn't know the runtime state.
If you run all shaders that could perform discard last, you guarantee that rest of the scene doesn't suffer from worse Z-compression / early-Z / Hi-Z performance. And TBDR doesn't need to do extra partial tile evaluations.
Our new code base uses my Hyper RHI directly in user land code. It's pretty clean. Struct/span based APIs with good defaults. No heap allocs (initializer lists live in stack for the function call duration).
This is how the G-buffer pass looks currently:
gpu_temp_allocator (and dynamicBindings.allocate) bump allocate persistently mapped GPU memory. CPU pointer directly to GPU VRAM (PCIE-Rebar or UMA). Uniform are directly written to VRAM. No copies at all. Draw stream has only bind group handles and PSO handles (32-bit).
Vulkan 1.3 + Metal 2.0 + WebGPU all use dynamic rendering. No persistent render pass handles. Render backends have zero hashmap policy. Except for Vulkan 1.1 render passes, which must be persistent, so there's a tiny hash map for that. Lookup once per pass = fine.
It's tempting to give the LLM a MASSIVE system prompt with all the information it needs to perform all the potential task API calls. This way you don't need to think about it and you ensure there's no extra roundtrips for the LLM to find the information/APIs it needs. The problem is that this bloats the token count significantly.
LLM calls (to server) are stateless, you need to send the system prompt (and history) again for every tool call, so that the LLM knows what it was doing and why. If the system prompt is thousands of lines, those lines are resent for every tool call.
Let's discuss the alternatives for a massive system prompt...
I already discussed flexible/batchable tool interfaces in this post: x.com/SebAaltonen/st…
There's basically no limit to tool flexiblity. You can go as far as offering tool APIs to run python or terminal commands in the system. Search tools are common. Instead of the LLM going through your project, it can find the info faster. Flexibility and batchability are of course crucial for cutting down the number of roundtrips and the extra data that needs to be transmitted between the system and the LLM. Similar idea as SQL-queries. Do the heavy work locally, minimize external communication.
Tree data structures are common in programming. We all know that deep tree structures (such as binary tree) results in lots of cache misses, since search goes through many hops. Trees with wider nodes are significantly flatter. We are using a two level sparse bitmap for our index joins for example. It's a 64-tree. two level accesses = 4096 elements. Binary tree requires 12 levels for that.
Similarly when LLM is searching for documentation or information, you don't want super deep hierarchy. You could even embed the top level to the system prompt if it's super small to avoid one extra hop (listing ~10 top level categories of documentation for example inside the API spec instead of having to call API that lists them). Folder structures and AGENTS.MD files are a bit similar. If you have a super deep structure with lots of info, it takes the LLM a lot more effort to dig through that.
But, it's important to avoid extra waste of tokens too. If you print out something to the LLM context, you want to use that. It's fine to have slight bit of extra info related to the topic in it, maybe that even helps the AI to understand it better, but lots of extra info (tokens) adds costs, adds latency and makes reasoning worse. AI needs to be able to focus too. Unrelated noise is bad.
Wouldn't this be a lovely hosted server for a hobby proto MMO project? 48 core Threadripper, 256GB RAM, 4TB SSD. 1Gbit/s unlimited.
Should be able to handle 10,000 players just fine. That's a start. 1Gbit/s = 100MB/s. 10KB/s send+receive for each player. = great!
I was talking about 100,000 players before, but that's an aspirational goal for a real MMO game with paid customers. 10,000 players is a fine start point for prototyping. Will be difficult to even get that many players even if it's a free web game (no download).
10k players data replicated to 10k players = 100M player datas sent. At 100MB send bandwidth this means 1 byte per player on average per second. That's more than enough with a great compressor. Netflix video compressor uses ~0.1 bits per pixel.
It's depressing that software engineering mostly wastes the hardware advantages to make programming "easier" and "cheaper" = sloppy code. Every 2 decades we get 1000x faster hardware (Moore).
I'd like to see real improvements, like 1000x more players MP:
If people still wrote code as optimally as me, Carmack and others did in the late 90s, we could achieve things that people today think are not even possible. Those things are not impossible to achieve if we really want. And that's why I think I need to do this hobby project too.
We wrote a real-time MP game for Nokia N-Gage: in-order 100MHz CPU, no FPU, no GPU, 16MB RAM, 2G GPRS modem with 1 second latency between players. We had rollback netcode (one of the first). We just have to think outside the box to make it happen. Why is nobody doing it anymore?
I've been thinking about a 100,000 player MMO recently (1 server, 1 world) with fully distributed physics (a bit like parallel GPGPU physics). Needs a very good predictive data compressor. Ideas can be borrowed from video compressors. 4K = 8 million pixels. I have only 100k...
100k players sending their state to server is not a problem. That's O(n). The big problem is updating every other player state to every player. That's O(n^2). And at 100k players that's 100k*100k = 10G. Server can't obviously send 10G player state infos at acceptable rate.
There must be a fixed budget per player. Otherwise the server will choke. This is similar to fixed bandwidth rate in the video compressors. If there's too much hard to compress new information, then the quality automatically drops.