Sebastian Aaltonen Profile picture
Building a new renderer at HypeHype. Former principal engineer at Unity and Ubisoft. Opinions are my own.
20 subscribers
Oct 5 30 tweets 6 min read
Let's talk about CPU scaling in games. On recent AMD CPUs, most games run better on single CCD 8 core model. 16 cores doesn't improve performance. Also Zen 5 was only 3% faster in games, while 17% faster in Linux server. Why? What can we do to make games scale?

Thread... History lesson: Xbox One / PS4 shipped with AMD Jaguar CPUs. There was two 4 core clusters with their own LLCs. Communication between these clusters was through main memory. You wanted to minimize data sharing between these clusters to minimize the memory overhead.
Jun 27 8 tweets 2 min read
I've been talking about mobile vs desktop GPUs lately. I wanted to clarify that I focus on mainstream GPUs, which are popular among younger audiences. Cheap <200$ Androids and 5+ year old Apple GPUs.

Apple's newest GPU is actually very advanced:
The linked thread is speculation mixed with Apple's marketing material. Some details might be wrong.

This is the holy grail. Running branchy shaders on GPU efficiently. No need to compile millions of shader variants to statically branch ahead of time. Apple is doing it first.
Jun 25 14 tweets 3 min read
This is why I have concerns about GPU-driven rendering on mobile GPU architectures. Even on the latest Android phones.

Rainbow Six: Siege is using a GPU-driven renderer (Ubisoft had 3 teams doing a GPU-driven rendered game back then). Intel iGPU runs it 2.65x faster. Image TimeSpy does volume ray-marching and per pixel OIT. Consoles got proper 3d local volume texture support recently. My benchmark using M1 Max back int the day suggests that even Apple doesn't have 3d local volume textures in their phones. Maybe in M3. Didn't run my SDF test in it.
May 19 19 tweets 4 min read
Recently a book was released about Nokia's N-Gage handheld game console. Back then Nokia was one of the most valuable companies in EU. They spent lots of money to conquer the gaming market.

I was the lead programmer in their flagship game.

Thread...

hs.fi/taide/art-2000… The book realistically covers how crazy the spending of the money was. Nokia spent massive amount of money into marketing. They wanted to look like a gaming company. Big stands at GDC and other places. Massive launch parties, lots of press invited, etc, etc.
Feb 14 8 tweets 2 min read
I wrote my first engine (called Storm3D) during my University studies. Licensed it to FrozenByte and some other Finnish game companies. They shipped a few games using it.

It was a classic object oriented C++ code base with diamond inheritance. That was cool back in the day :) In year 2000 object oriented programmer was based on real world examples "hawk is a bird" and people tried to solve diamond pattern issues with birds that were not flying. Later people realized that real world abstractions are bad. You should inherit/compose "flying" instead.
Dec 27, 2023 24 tweets 4 min read
Have been thinking about possible GPU-driven rendering approaches for older mobile GPUs. Traditional vertex buffers, no bindless textures, and no multidraw. Thread... If you put all your meshes in one big mesh cache (1 buffer), you can simply change the mesh by modifying indirect draw call start index and index count. With indirect instanced draw call, you can modify the number of instances of that type.
Dec 22, 2023 9 tweets 2 min read
Intel dropped Larrabee for a reason. The workloads just didn't exist back then. Dedicated AI processors are still niche.

In 2009 games were designed for Xbox360 & PS3. No compute shaders. Simple pixel/vertex workloads with minimal control flow...

tomshardware.com/pc-components/… The simple SIMD design of Nvidia and AMD GPUs was just better fit for both power and performance for running graphics workloads of that era. Larrabee used modified P54C cores. It had 4 wide hyperthreading and wider SIMD units, but it was still more like a CPU than GPU.
Oct 26, 2023 6 tweets 2 min read
New cascaded shadows GPU time = 68%

More precise culling helps a lot.

It's still a single render pass. 2048x2048 texture atlas. Three 1024x1024 cascades in it (last 1024x1024 region saved for local lights).

Let's talk about the correct way to cull shadows... Image The old shadow rendering fit one shadow map to the whole frustum. Then it culled all objects using the shadow frustum, which is a ortho box. Object in areas marked with red can't cast a shadow to the visible frustum geometry. ~3x shadow draw count. Big waste.
Oct 12, 2023 8 tweets 2 min read
Figured out a fast way to do prefix sum of a byte mask (for draw stream compaction).

Let's assume we have a byte mask. 8 bytes inside a 64 bit integer. Each byte has their first bit set if that slot has data. Their values are 1, 256, 65536, etc...

Thread... If we want to propagate a bit to all higher bits, we first multiply that bit by 1+256+65536+16M+... This is a fixed constant. Then we add this number to the result. Now all higher bits in the 64 bit uint have their counters increased if the first bit was one...
Sep 18, 2023 8 tweets 2 min read
Discussing about importance of ambient lighting techniques with our team. This screenshot is a perfect example how bad big shadowed areas look if you don't handle this problem in a good way. Image The sun light (point light) is not the only light source outdoors. The sky light (big area light) is super important too. You need at least some level of approximation of sky light visibility to make the outdoor rendering look good.
Sep 10, 2023 13 tweets 3 min read
I think I finally "understood" theory of relativity. The graphics programmer version: It's really just some normal map decode math and tonemapping :)

Thread... Image The space is 4d instead of 3d. First axis is time. We always move at speed of light. If object is stationary it moves at speed of light in time direction. Massless particles moving a speed of light don't move at all in time direction.
Sep 3, 2023 14 tweets 3 min read
Somebody asked why you can't return arrays (dynamic sized data) in C. There's a interesting technical limitation in stack based programming languages that prevents this. It's not possible to return dynamic amount of data using the stack.

Thread... We all know that alloca exists. This allows you to allocate dynamic amount of data from top of the stack. This works like any static sized stack variable. It dies at return. No issues there. Why is the return then a problem?
Aug 26, 2023 5 tweets 2 min read
RG11B10F is supported on our min spec phones (incl Android). 32 bpp + all mobile GPUs support framebuffer compression for it (RGBA16f framebuffer compression was introduced later).

This makes writing proper PBR pipeline easy. Also considering HDR output.

developer.apple.com/videos/play/ww…
Image On Xbox 360 RGB10FA2 (float) was supported as render target, but not as a texture. You had to manually unpack it from RGB10A2. Thus no filtering, etc. RGBA16F was half rate render / half rate filter. It was painful back then. People used RGBM8 and similar tricks.
Aug 5, 2023 51 tweets 9 min read
A thread about rollback network code.

Last weekend was Tekken 8 beta and many gamers were talking about rollback netcode (Tekken finally has it). I used to write network code in the past, including rollback netcode, so I will explain how it works in fighting games now... I will be limiting the discussion to deterministic games running at fixed step rate (60 simulation steps per second is the most popular choice). Deterministic simulation means that all clients agree 100% on the game state every frame, except in case of rollback of course...
Jul 19, 2023 18 tweets 4 min read
Thread comparing V-buffer and hardware TBDR approaches.

Some of my TBDR/POSH thoughts here:
When I was designing V-buffer style renderer in 2015 I was a bit concerned about having to run vertex shader 3 times per pixel. People might say that this is fine if you are targeting 1:1 pixel:triangle density like Nanite does, but that's cutting corners...
Jul 16, 2023 31 tweets 5 min read
Let's talk about fast draw calls on bottom 50% mobile devices. And let's ignore the bottom 10% so that we can assume Vulkan 1.0 support = compute shaders.

Why are we not doing GPU-driven rendering? Why instancing is not always a win? The first thing I want to talk about is memory loads on old GPUs. Before Nvidia Turing and AMD RDNA1 all buffer loads on these vendors were going though the texture samplers. Texture samplers have high latency (~100 cycles). The only exception was uniform buffer & scalar loads.
Jul 12, 2023 6 tweets 1 min read
@NinjaJez @dannthr @AeornFlippout In rendering, in the beginning of the shader you subtract the camera (int xyz) position from the object (int xyz) position. Then you convert the result to floating point. And do rest of the math in floating point. @NinjaJez @dannthr @AeornFlippout This way you get no precision issues regardless where you are in the world. Same of course applies to light sources. The camera/object/light matrix has xyz translate as integer and the 3x3 rotation/scale part as float.
May 9, 2023 5 tweets 2 min read
16 bit unorm is the best vertex position format. I have shipped many games using it. HypeHype will soon use it too (2.5x mem savings).

Precalc model xyz bounds and round it to next pow2 to ensure that you get zero precision issues with pow2 grid snapping (for kitbashed content). For 2.5x storage savings, I also store the tangent frame in a more dense way. We will use the same 16 bit UV encoding that Horizon Forbidden West uses. Image
May 3, 2023 6 tweets 3 min read
Splatting gaussians instead of ray-marching. Reminds me of particle based renderer experiments. Interesting to see whether gather or scatter algos win this round.

@NOTimothyLottes thoughs? @NOTimothyLottes Translation: Gaussians = stretched particles. Research papers need fancy terminology :)
May 2, 2023 4 tweets 1 min read
C++20 designated initializers, C++11 struct default values and a custom span type (with support for initializer lists) is a good combination for graphics resource creation: Image Declaring default values with C++11 aggregate initialization syntax is super clean. All the API code you need is this struct. No need to implement builders or other code bloat that you need to maintain. Image
May 2, 2023 7 tweets 2 min read
Managed to generate binding ids for the generated GLSL shader for GLES3/WebGL2 using SPIRV-Cross API.

GLES doesn't have sets, so I must generate a flat contiguous range of binding ids per set and store the set start index. Runtime binds N slots at a time (bind groups). Image I also must dump a mapping table for combined samplers in the shader. Our renderer has separate samplers and images. Image