Sebastian Aaltonen Profile picture
Building a new renderer at HypeHype. Former principal engineer at Unity and Ubisoft. Opinions are my own.
24 subscribers
Sep 30 13 tweets 3 min read
When you design data structures, always think in cache lines (64B or 128B). You don't want to have tiny nodes scattered around the memory. Often it's better to have wider nodes (preferably 1 cache line each) and shallower structures. Less pointer/offset indirections. When implementing spatial data structures, you also need to think about spatial locality. If you have early outs, then consider embedding early out conditions as a bitfield to the spatial data structure directly, instead of fetching objects before the early out.
Sep 28 17 tweets 4 min read
What do you guys do when you get a sudden urge to build a 100,000 player MMO with fully destructible world? You do the math that it could run on a single 128 core Threadripper server (everybody in the same world) with your crazy netcode ideas and super well optimized C code... Before Claybook we had a multiplayer prototype with 1GB SDF world state modified with commands (deterministic static world state, undeterministic dynamic object state). Tiny network traffic. Could easily scale this idea to 1TB world (2TB RAM on that Threadripper)...
Sep 26 6 tweets 1 min read
Finland was a key tech player 20 years ago: We invented SSH and IRC protocols. Nokia was EUs most expensive company, selling more phones yearly than Apple and Samsung sell today combined. We invented the OS that runs most internet servers today. Nokia failed and Linux is free... Finland has some new successes: Wolt being the biggest EU food delivery service, Oura being the first health ring and Silo AI being one of EUs biggest AI companies. Wolt got sold to Doordash ($3.5B), Silo AI got sold to AMD ($665M). Oura is still a $11B Finnish company.
Sep 19 20 tweets 4 min read
I have realized that there's not that many people out there who understand the big picture of modern GPU hardware and all the APIs: Vulkan 1.4 with latest extensions, DX12 SM 6.6, Metal 4, OpenCL and CUDA. What is the hardware capable of? What should a modern API look like? My "No Graphics API" blog post will discuss all of this. My conclusion is that Metal 4.0 is actually closest to the goal. It has flaws too. DX12 SM 6.6 doesn't have those particular flaws, but has a lot of other flaws. Vulkan has all the flaws combined, with useful extensions :)
May 30 4 tweets 1 min read
The past decades have been a wonderful time for gamers+devs. The biggest chips, using the latest nodes and trillions worth of R&D, were all targeted at gaming. Now, those chips are needed by professionals (AI). We'll never see a big die GPU at a reasonable price point anymore :( The fun lasted for a very long time, but it's over in both CPU and GPU side. The biggest CPU and GPU dies are no longer designed for gamers. Top end Threadripper costs over 10k$ today. Top end Nvidia B200 costs over 30k$. Few generations ago top tier HW was targeting gamers :(
May 18 9 tweets 2 min read
Unit tests have lots of advantages, but cons are ignored:
- Code must be split to testable parts. Often requiring more interfaces, which add code bloat and complexity.
- Each call site is a dependency. Test case = +1 dependency. Added inertia to refactor and throw away code.
... - Bloated unit test suites taking several hours to execute. Slows down devs and causes merge conflicts as pushes are delayed.
- Unstable tests randomly failing pushes.
- Unit test maintenance and optimization needed to keep tests manageable. Otherwise developer velocity hurts.
May 7 5 tweets 2 min read
When you split a function to N different small functions, the reader also suffers multiple "instruction cache" misses (similar to CPU when executing it). They need to jump around the code base to continue reading. Big linear functions are fine. Code should read like a book. Messy big functions with lots of indentation (loops, branches) should be avoided. Extracting is a good practice here. But often functions like this are a code smell. Why do you need those branches? Why is the function doing too many unrelated things? Maybe too generic? Refactor?
Jan 23 10 tweets 2 min read
WebGPU CPU->GPU update paths are designed to be super hard to use. Map is async and you should not wait for it. Thus you can't map->write->render in the same frame.

wgpuQueueWriteBuffer runs in CPU timeline. You need to wait for callback to know buffer is not in use.

Thread... Waiting for callback is not recommended in web and there's no API for asking how many frames you have in flight. So you have to dynamically create new staging buffers (in a ring) based on callbacks to use wgpuQueueWriteBuffer safely. Otherwise it will trash data used by GPU.
Jan 22 10 tweets 3 min read
Refactored our CommandBuffer interface to support compute. Final result:

A compute pass contains N dispatches, just like a render pass contains N draws (split into areas = viewports).

Renderpass object is static (due to Vulkan 1.0). Compute has dynamic write resource list. Image This is how you would use the API to dispatch a compute pass with a single compute shader writing to two SSBOs. Image
Dec 17, 2024 6 tweets 2 min read
I would really love to write a blog post about this whole debate. This is a too complex topic to discuss on Twitter.

Ubisoft was among the first to develop GPU-driven rendering and temporal upscaling. AC: Unity, Rainbow Six Siege, For Honor, etc. Our SIGGRAPH 2015 talk, etc... Originally, TAA was seen as an improvement over screen space post-process AA techniques as it provided subpixel information. It wasn't just a fancy blur algo. Today, people render so much noise. Noise increases neighborhood bbox/variance. Which increases ghosting.
Nov 22, 2024 8 tweets 2 min read
AMD UDNA will be interesting.

CDNA3 architecture is still based on GCN 4 cycle wave64 scheduling. RDNA schedules every cycle and exposes instruction latency. Scheduler runs/blocks instructions concurrently/dynamically. RDNA is much closer to Nvidia GPUs.

Thread... Image CDNA has wide matrix cores and other wide compute workload improvements, which AMD wants to bring to UDNA. It also has multi-chip scaling.

Rumors tell that RDNA4 will finally have matrix cores in consumer space. Seems that AMD is integrating matrix cores early to RDNA lineup.
Nov 20, 2024 5 tweets 1 min read
I am impressed about our new WebGPU WASM page load time. Whole engine loads in just a few hundreds of milliseconds. And games load pretty much instantly too. This is how fast things should load. We have spent a massive effort optimizing the data structures. I also rewrote the renderer.

You can't get this kind of load times with lots of tiny memory allocs, shared pointers, hash map lookups, frequent mutexes and all other slow stuff.
Oct 5, 2024 30 tweets 6 min read
Let's talk about CPU scaling in games. On recent AMD CPUs, most games run better on single CCD 8 core model. 16 cores doesn't improve performance. Also Zen 5 was only 3% faster in games, while 17% faster in Linux server. Why? What can we do to make games scale?

Thread... History lesson: Xbox One / PS4 shipped with AMD Jaguar CPUs. There was two 4 core clusters with their own LLCs. Communication between these clusters was through main memory. You wanted to minimize data sharing between these clusters to minimize the memory overhead.
Jun 27, 2024 8 tweets 2 min read
I've been talking about mobile vs desktop GPUs lately. I wanted to clarify that I focus on mainstream GPUs, which are popular among younger audiences. Cheap <200$ Androids and 5+ year old Apple GPUs.

Apple's newest GPU is actually very advanced:
The linked thread is speculation mixed with Apple's marketing material. Some details might be wrong.

This is the holy grail. Running branchy shaders on GPU efficiently. No need to compile millions of shader variants to statically branch ahead of time. Apple is doing it first.
Jun 25, 2024 14 tweets 3 min read
This is why I have concerns about GPU-driven rendering on mobile GPU architectures. Even on the latest Android phones.

Rainbow Six: Siege is using a GPU-driven renderer (Ubisoft had 3 teams doing a GPU-driven rendered game back then). Intel iGPU runs it 2.65x faster. Image TimeSpy does volume ray-marching and per pixel OIT. Consoles got proper 3d local volume texture support recently. My benchmark using M1 Max back int the day suggests that even Apple doesn't have 3d local volume textures in their phones. Maybe in M3. Didn't run my SDF test in it.
May 19, 2024 19 tweets 4 min read
Recently a book was released about Nokia's N-Gage handheld game console. Back then Nokia was one of the most valuable companies in EU. They spent lots of money to conquer the gaming market.

I was the lead programmer in their flagship game.

Thread...

hs.fi/taide/art-2000… The book realistically covers how crazy the spending of the money was. Nokia spent massive amount of money into marketing. They wanted to look like a gaming company. Big stands at GDC and other places. Massive launch parties, lots of press invited, etc, etc.
Feb 14, 2024 8 tweets 2 min read
I wrote my first engine (called Storm3D) during my University studies. Licensed it to FrozenByte and some other Finnish game companies. They shipped a few games using it.

It was a classic object oriented C++ code base with diamond inheritance. That was cool back in the day :) In year 2000 object oriented programmer was based on real world examples "hawk is a bird" and people tried to solve diamond pattern issues with birds that were not flying. Later people realized that real world abstractions are bad. You should inherit/compose "flying" instead.
Dec 27, 2023 24 tweets 4 min read
Have been thinking about possible GPU-driven rendering approaches for older mobile GPUs. Traditional vertex buffers, no bindless textures, and no multidraw. Thread... If you put all your meshes in one big mesh cache (1 buffer), you can simply change the mesh by modifying indirect draw call start index and index count. With indirect instanced draw call, you can modify the number of instances of that type.
Dec 22, 2023 9 tweets 2 min read
Intel dropped Larrabee for a reason. The workloads just didn't exist back then. Dedicated AI processors are still niche.

In 2009 games were designed for Xbox360 & PS3. No compute shaders. Simple pixel/vertex workloads with minimal control flow...

tomshardware.com/pc-components/… The simple SIMD design of Nvidia and AMD GPUs was just better fit for both power and performance for running graphics workloads of that era. Larrabee used modified P54C cores. It had 4 wide hyperthreading and wider SIMD units, but it was still more like a CPU than GPU.
Oct 26, 2023 6 tweets 2 min read
New cascaded shadows GPU time = 68%

More precise culling helps a lot.

It's still a single render pass. 2048x2048 texture atlas. Three 1024x1024 cascades in it (last 1024x1024 region saved for local lights).

Let's talk about the correct way to cull shadows... Image The old shadow rendering fit one shadow map to the whole frustum. Then it culled all objects using the shadow frustum, which is a ortho box. Object in areas marked with red can't cast a shadow to the visible frustum geometry. ~3x shadow draw count. Big waste.
Oct 12, 2023 8 tweets 2 min read
Figured out a fast way to do prefix sum of a byte mask (for draw stream compaction).

Let's assume we have a byte mask. 8 bytes inside a 64 bit integer. Each byte has their first bit set if that slot has data. Their values are 1, 256, 65536, etc...

Thread... If we want to propagate a bit to all higher bits, we first multiply that bit by 1+256+65536+16M+... This is a fixed constant. Then we add this number to the result. Now all higher bits in the 64 bit uint have their counters increased if the first bit was one...