Sebastian Aaltonen Profile picture
May 9, 2023 5 tweets 2 min read Read on X
16 bit unorm is the best vertex position format. I have shipped many games using it. HypeHype will soon use it too (2.5x mem savings).

Precalc model xyz bounds and round it to next pow2 to ensure that you get zero precision issues with pow2 grid snapping (for kitbashed content).
For 2.5x storage savings, I also store the tangent frame in a more dense way. We will use the same 16 bit UV encoding that Horizon Forbidden West uses. Image
@Dan87626237 Our solution was to cut the support for bottom 5% hardware. This allowed us to support more modern feature set (such as this texture format). Improves the performance and visuals for the remaining 95%. Makes our new min spec devices actually playable.
@guycalledfrank Of course you use 32 bit floating point for interpolating UVs between VS->PS. 16 bit float interpolants are not enough.
@guycalledfrank fp16 gives you only 10+1+1 = 12 bits equal precision outside the 1/2^3 area in the center, which is where most of the object vertices tend to be. 16 bits is 16x more precision than 12 bits. Thus 16 bit unorm is the better mesh format by far.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Aaltonen

Sebastian Aaltonen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SebAaltonen

Mar 1
It's tempting to give the LLM a MASSIVE system prompt with all the information it needs to perform all the potential task API calls. This way you don't need to think about it and you ensure there's no extra roundtrips for the LLM to find the information/APIs it needs. The problem is that this bloats the token count significantly.

LLM calls (to server) are stateless, you need to send the system prompt (and history) again for every tool call, so that the LLM knows what it was doing and why. If the system prompt is thousands of lines, those lines are resent for every tool call.

Let's discuss the alternatives for a massive system prompt...
I already discussed flexible/batchable tool interfaces in this post: x.com/SebAaltonen/st…

There's basically no limit to tool flexiblity. You can go as far as offering tool APIs to run python or terminal commands in the system. Search tools are common. Instead of the LLM going through your project, it can find the info faster. Flexibility and batchability are of course crucial for cutting down the number of roundtrips and the extra data that needs to be transmitted between the system and the LLM. Similar idea as SQL-queries. Do the heavy work locally, minimize external communication.
Tree data structures are common in programming. We all know that deep tree structures (such as binary tree) results in lots of cache misses, since search goes through many hops. Trees with wider nodes are significantly flatter. We are using a two level sparse bitmap for our index joins for example. It's a 64-tree. two level accesses = 4096 elements. Binary tree requires 12 levels for that.

Similarly when LLM is searching for documentation or information, you don't want super deep hierarchy. You could even embed the top level to the system prompt if it's super small to avoid one extra hop (listing ~10 top level categories of documentation for example inside the API spec instead of having to call API that lists them). Folder structures and AGENTS.MD files are a bit similar. If you have a super deep structure with lots of info, it takes the LLM a lot more effort to dig through that.

But, it's important to avoid extra waste of tokens too. If you print out something to the LLM context, you want to use that. It's fine to have slight bit of extra info related to the topic in it, maybe that even helps the AI to understand it better, but lots of extra info (tokens) adds costs, adds latency and makes reasoning worse. AI needs to be able to focus too. Unrelated noise is bad.
Read 5 tweets
Nov 2, 2025
Wouldn't this be a lovely hosted server for a hobby proto MMO project? 48 core Threadripper, 256GB RAM, 4TB SSD. 1Gbit/s unlimited.

Should be able to handle 10,000 players just fine. That's a start. 1Gbit/s = 100MB/s. 10KB/s send+receive for each player. = great! Image
I was talking about 100,000 players before, but that's an aspirational goal for a real MMO game with paid customers. 10,000 players is a fine start point for prototyping. Will be difficult to even get that many players even if it's a free web game (no download).
10k players data replicated to 10k players = 100M player datas sent. At 100MB send bandwidth this means 1 byte per player on average per second. That's more than enough with a great compressor. Netflix video compressor uses ~0.1 bits per pixel.
Read 14 tweets
Nov 1, 2025
It's depressing that software engineering mostly wastes the hardware advantages to make programming "easier" and "cheaper" = sloppy code. Every 2 decades we get 1000x faster hardware (Moore).

I'd like to see real improvements, like 1000x more players MP:
If people still wrote code as optimally as me, Carmack and others did in the late 90s, we could achieve things that people today think are not even possible. Those things are not impossible to achieve if we really want. And that's why I think I need to do this hobby project too.
We wrote a real-time MP game for Nokia N-Gage: in-order 100MHz CPU, no FPU, no GPU, 16MB RAM, 2G GPRS modem with 1 second latency between players. We had rollback netcode (one of the first). We just have to think outside the box to make it happen. Why is nobody doing it anymore?
Read 9 tweets
Nov 1, 2025
I've been thinking about a 100,000 player MMO recently (1 server, 1 world) with fully distributed physics (a bit like parallel GPGPU physics). Needs a very good predictive data compressor. Ideas can be borrowed from video compressors. 4K = 8 million pixels. I have only 100k...
100k players sending their state to server is not a problem. That's O(n). The big problem is updating every other player state to every player. That's O(n^2). And at 100k players that's 100k*100k = 10G. Server can't obviously send 10G player state infos at acceptable rate.
There must be a fixed budget per player. Otherwise the server will choke. This is similar to fixed bandwidth rate in the video compressors. If there's too much hard to compress new information, then the quality automatically drops.
Read 8 tweets
Oct 23, 2025
AI generated C is a real deal. C coders wrote fast & simple code. No high freq heap allocs, no abstractions slowing the compiler down. Lots of good C example code around. Ai workflows need a language with fast iteration time. Why waste compile time and perf on modern languages?
If you generate C++ with AI, it will use smart pointers and short lived temp std::vectors and std::strings like all slow C++ code bases do. Lots of tiny heap allocs. Idiomatic Rust is slightly better, but idiomatic Rust is still a lot more heap allocs than C. It's so easy.
And why would you even think about generating Python with AI? Why would you choose a 100x slower language if AI is writing it instead of you? Same applies to Javascript and other common internet languages. Just generate C and compile to WASM. Nothing runs faster.
Read 5 tweets
Oct 18, 2025
Let's discuss why I think 4x4x4 tree is better than 2x2x2 (oct) tree for voxel storage.

It all boils down to link overhead and memory access patterns. L1$ hit rate is the most important thing for GPU performance nowadays.

Thread...
2x2x2 = uint8. That's one byte. Link = uint32 = 4 bytes. Total = 5 bytes.

4x4x4 = uint64. That's 8 bytes. Total (with link) = 12 bytes.

4x4x4 tree is half as deep as 2x2x2 tree. You get up/down twice as fast.
Voxel mask (in non-leaf nodes) tells us which children are present. You can do a popcnt (full rate instruction on GPU) to count the children. Children are allocated next to each other. Children sub-address can be calculated with binary prefix sum (= AND prev bits + popcnt).
Read 17 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(