Sebastian Aaltonen Profile picture
Building a new renderer at HypeHype. Former principal engineer at Unity and Ubisoft. Opinions are my own.
24 subscribers
Oct 23 5 tweets 1 min read
AI generated C is a real deal. C coders wrote fast & simple code. No high freq heap allocs, no abstractions slowing the compiler down. Lots of good C example code around. Ai workflows need a language with fast iteration time. Why waste compile time and perf on modern languages? If you generate C++ with AI, it will use smart pointers and short lived temp std::vectors and std::strings like all slow C++ code bases do. Lots of tiny heap allocs. Idiomatic Rust is slightly better, but idiomatic Rust is still a lot more heap allocs than C. It's so easy.
Oct 18 17 tweets 4 min read
Let's discuss why I think 4x4x4 tree is better than 2x2x2 (oct) tree for voxel storage.

It all boils down to link overhead and memory access patterns. L1$ hit rate is the most important thing for GPU performance nowadays.

Thread... 2x2x2 = uint8. That's one byte. Link = uint32 = 4 bytes. Total = 5 bytes.

4x4x4 = uint64. That's 8 bytes. Total (with link) = 12 bytes.

4x4x4 tree is half as deep as 2x2x2 tree. You get up/down twice as fast.
Oct 18 4 tweets 1 min read
People often think voxels take a lot of storage. Let's compare a smooth terrain. Height map vs voxels.

Height map: 16bit 8192x8192 = 128MB

Voxels: 4x4x4 brick = uin64 = 8 bytes. We need 2048x2048 bricks to cover the 8k^2 terrain surface = 32MB. SVO/DAG upper levels add <10% The above estimate is optimistic. If we have a rough terrain, we end up having two bricks on top of each other in most places. Thus we have 64MB worth of leaf bricks. SVO/DAG upper levels don't increase much (as we use shared child pointers). Total is <70MB. Still a win.
Oct 11 10 tweets 3 min read
"It's much faster, performance and to load things"

Zuck's reasoning started with performance. Performance matters. Google maps won because of performance, Nokia lost because of Symbian OS not being designed for real-time systems (touchscreen needs it). Unity has to wake up. Performance was also the main reason Unity's Weta Digital acquisition failed. You can't just buy a movie company and believe that real time game engine + movies = real time movies. A massive amount of optimization and architecture refactoring work is required to make it happen.
Sep 30 13 tweets 3 min read
When you design data structures, always think in cache lines (64B or 128B). You don't want to have tiny nodes scattered around the memory. Often it's better to have wider nodes (preferably 1 cache line each) and shallower structures. Less pointer/offset indirections. When implementing spatial data structures, you also need to think about spatial locality. If you have early outs, then consider embedding early out conditions as a bitfield to the spatial data structure directly, instead of fetching objects before the early out.
Sep 28 17 tweets 4 min read
What do you guys do when you get a sudden urge to build a 100,000 player MMO with fully destructible world? You do the math that it could run on a single 128 core Threadripper server (everybody in the same world) with your crazy netcode ideas and super well optimized C code... Before Claybook we had a multiplayer prototype with 1GB SDF world state modified with commands (deterministic static world state, undeterministic dynamic object state). Tiny network traffic. Could easily scale this idea to 1TB world (2TB RAM on that Threadripper)...
Sep 26 6 tweets 1 min read
Finland was a key tech player 20 years ago: We invented SSH and IRC protocols. Nokia was EUs most expensive company, selling more phones yearly than Apple and Samsung sell today combined. We invented the OS that runs most internet servers today. Nokia failed and Linux is free... Finland has some new successes: Wolt being the biggest EU food delivery service, Oura being the first health ring and Silo AI being one of EUs biggest AI companies. Wolt got sold to Doordash ($3.5B), Silo AI got sold to AMD ($665M). Oura is still a $11B Finnish company.
Sep 19 20 tweets 4 min read
I have realized that there's not that many people out there who understand the big picture of modern GPU hardware and all the APIs: Vulkan 1.4 with latest extensions, DX12 SM 6.6, Metal 4, OpenCL and CUDA. What is the hardware capable of? What should a modern API look like? My "No Graphics API" blog post will discuss all of this. My conclusion is that Metal 4.0 is actually closest to the goal. It has flaws too. DX12 SM 6.6 doesn't have those particular flaws, but has a lot of other flaws. Vulkan has all the flaws combined, with useful extensions :)
May 30 4 tweets 1 min read
The past decades have been a wonderful time for gamers+devs. The biggest chips, using the latest nodes and trillions worth of R&D, were all targeted at gaming. Now, those chips are needed by professionals (AI). We'll never see a big die GPU at a reasonable price point anymore :( The fun lasted for a very long time, but it's over in both CPU and GPU side. The biggest CPU and GPU dies are no longer designed for gamers. Top end Threadripper costs over 10k$ today. Top end Nvidia B200 costs over 30k$. Few generations ago top tier HW was targeting gamers :(
May 18 9 tweets 2 min read
Unit tests have lots of advantages, but cons are ignored:
- Code must be split to testable parts. Often requiring more interfaces, which add code bloat and complexity.
- Each call site is a dependency. Test case = +1 dependency. Added inertia to refactor and throw away code.
... - Bloated unit test suites taking several hours to execute. Slows down devs and causes merge conflicts as pushes are delayed.
- Unstable tests randomly failing pushes.
- Unit test maintenance and optimization needed to keep tests manageable. Otherwise developer velocity hurts.
May 7 5 tweets 2 min read
When you split a function to N different small functions, the reader also suffers multiple "instruction cache" misses (similar to CPU when executing it). They need to jump around the code base to continue reading. Big linear functions are fine. Code should read like a book. Messy big functions with lots of indentation (loops, branches) should be avoided. Extracting is a good practice here. But often functions like this are a code smell. Why do you need those branches? Why is the function doing too many unrelated things? Maybe too generic? Refactor?
Mar 2 27 tweets 5 min read
Lately there's been a lot of discussion about modern visuals and their trade-offs. @digitalfoundry 's video is a good overview of this debate. Thread about my thoughts...

@digitalfoundry I recently finished FF7 Remake and have 50 hours of playtime in FF7 Rebirth too. It's interesting to compare these two games as they use the same character models, but the first is PS4 game remastered to PS5 and has much more limited environment. While sequel is big open world.
Jan 23 10 tweets 2 min read
WebGPU CPU->GPU update paths are designed to be super hard to use. Map is async and you should not wait for it. Thus you can't map->write->render in the same frame.

wgpuQueueWriteBuffer runs in CPU timeline. You need to wait for callback to know buffer is not in use.

Thread... Waiting for callback is not recommended in web and there's no API for asking how many frames you have in flight. So you have to dynamically create new staging buffers (in a ring) based on callbacks to use wgpuQueueWriteBuffer safely. Otherwise it will trash data used by GPU.
Jan 22 10 tweets 3 min read
Refactored our CommandBuffer interface to support compute. Final result:

A compute pass contains N dispatches, just like a render pass contains N draws (split into areas = viewports).

Renderpass object is static (due to Vulkan 1.0). Compute has dynamic write resource list. Image This is how you would use the API to dispatch a compute pass with a single compute shader writing to two SSBOs. Image
Dec 17, 2024 6 tweets 2 min read
I would really love to write a blog post about this whole debate. This is a too complex topic to discuss on Twitter.

Ubisoft was among the first to develop GPU-driven rendering and temporal upscaling. AC: Unity, Rainbow Six Siege, For Honor, etc. Our SIGGRAPH 2015 talk, etc... Originally, TAA was seen as an improvement over screen space post-process AA techniques as it provided subpixel information. It wasn't just a fancy blur algo. Today, people render so much noise. Noise increases neighborhood bbox/variance. Which increases ghosting.
Nov 22, 2024 8 tweets 2 min read
AMD UDNA will be interesting.

CDNA3 architecture is still based on GCN 4 cycle wave64 scheduling. RDNA schedules every cycle and exposes instruction latency. Scheduler runs/blocks instructions concurrently/dynamically. RDNA is much closer to Nvidia GPUs.

Thread... Image CDNA has wide matrix cores and other wide compute workload improvements, which AMD wants to bring to UDNA. It also has multi-chip scaling.

Rumors tell that RDNA4 will finally have matrix cores in consumer space. Seems that AMD is integrating matrix cores early to RDNA lineup.
Nov 20, 2024 5 tweets 1 min read
I am impressed about our new WebGPU WASM page load time. Whole engine loads in just a few hundreds of milliseconds. And games load pretty much instantly too. This is how fast things should load. We have spent a massive effort optimizing the data structures. I also rewrote the renderer.

You can't get this kind of load times with lots of tiny memory allocs, shared pointers, hash map lookups, frequent mutexes and all other slow stuff.
Oct 5, 2024 30 tweets 6 min read
Let's talk about CPU scaling in games. On recent AMD CPUs, most games run better on single CCD 8 core model. 16 cores doesn't improve performance. Also Zen 5 was only 3% faster in games, while 17% faster in Linux server. Why? What can we do to make games scale?

Thread... History lesson: Xbox One / PS4 shipped with AMD Jaguar CPUs. There was two 4 core clusters with their own LLCs. Communication between these clusters was through main memory. You wanted to minimize data sharing between these clusters to minimize the memory overhead.
Jun 27, 2024 8 tweets 2 min read
I've been talking about mobile vs desktop GPUs lately. I wanted to clarify that I focus on mainstream GPUs, which are popular among younger audiences. Cheap <200$ Androids and 5+ year old Apple GPUs.

Apple's newest GPU is actually very advanced:
The linked thread is speculation mixed with Apple's marketing material. Some details might be wrong.

This is the holy grail. Running branchy shaders on GPU efficiently. No need to compile millions of shader variants to statically branch ahead of time. Apple is doing it first.
Jun 25, 2024 14 tweets 3 min read
This is why I have concerns about GPU-driven rendering on mobile GPU architectures. Even on the latest Android phones.

Rainbow Six: Siege is using a GPU-driven renderer (Ubisoft had 3 teams doing a GPU-driven rendered game back then). Intel iGPU runs it 2.65x faster. Image TimeSpy does volume ray-marching and per pixel OIT. Consoles got proper 3d local volume texture support recently. My benchmark using M1 Max back int the day suggests that even Apple doesn't have 3d local volume textures in their phones. Maybe in M3. Didn't run my SDF test in it.
May 19, 2024 19 tweets 4 min read
Recently a book was released about Nokia's N-Gage handheld game console. Back then Nokia was one of the most valuable companies in EU. They spent lots of money to conquer the gaming market.

I was the lead programmer in their flagship game.

Thread...

hs.fi/taide/art-2000… The book realistically covers how crazy the spending of the money was. Nokia spent massive amount of money into marketing. They wanted to look like a gaming company. Big stands at GDC and other places. Massive launch parties, lots of press invited, etc, etc.