Sebastian Aaltonen Profile picture
Building a new renderer at HypeHype. Former principal engineer at Unity and Ubisoft. Opinions are my own.
morbidarne2007 Profile picture Ivan-Assen Ivanov 🇧🇬 Profile picture Dirk D. Profile picture Ihor Zvieriev🇺🇦 Profile picture tangjiangjun Profile picture 17 subscribed
Feb 14 8 tweets 2 min read
I wrote my first engine (called Storm3D) during my University studies. Licensed it to FrozenByte and some other Finnish game companies. They shipped a few games using it.

It was a classic object oriented C++ code base with diamond inheritance. That was cool back in the day :) In year 2000 object oriented programmer was based on real world examples "hawk is a bird" and people tried to solve diamond pattern issues with birds that were not flying. Later people realized that real world abstractions are bad. You should inherit/compose "flying" instead.
Dec 27, 2023 24 tweets 4 min read
Have been thinking about possible GPU-driven rendering approaches for older mobile GPUs. Traditional vertex buffers, no bindless textures, and no multidraw. Thread... If you put all your meshes in one big mesh cache (1 buffer), you can simply change the mesh by modifying indirect draw call start index and index count. With indirect instanced draw call, you can modify the number of instances of that type.
Dec 22, 2023 9 tweets 2 min read
Intel dropped Larrabee for a reason. The workloads just didn't exist back then. Dedicated AI processors are still niche.

In 2009 games were designed for Xbox360 & PS3. No compute shaders. Simple pixel/vertex workloads with minimal control flow...

tomshardware.com/pc-components/… The simple SIMD design of Nvidia and AMD GPUs was just better fit for both power and performance for running graphics workloads of that era. Larrabee used modified P54C cores. It had 4 wide hyperthreading and wider SIMD units, but it was still more like a CPU than GPU.
Oct 26, 2023 6 tweets 2 min read
New cascaded shadows GPU time = 68%

More precise culling helps a lot.

It's still a single render pass. 2048x2048 texture atlas. Three 1024x1024 cascades in it (last 1024x1024 region saved for local lights).

Let's talk about the correct way to cull shadows... Image The old shadow rendering fit one shadow map to the whole frustum. Then it culled all objects using the shadow frustum, which is a ortho box. Object in areas marked with red can't cast a shadow to the visible frustum geometry. ~3x shadow draw count. Big waste.
Oct 12, 2023 8 tweets 2 min read
Figured out a fast way to do prefix sum of a byte mask (for draw stream compaction).

Let's assume we have a byte mask. 8 bytes inside a 64 bit integer. Each byte has their first bit set if that slot has data. Their values are 1, 256, 65536, etc...

Thread... If we want to propagate a bit to all higher bits, we first multiply that bit by 1+256+65536+16M+... This is a fixed constant. Then we add this number to the result. Now all higher bits in the 64 bit uint have their counters increased if the first bit was one...
Sep 18, 2023 8 tweets 2 min read
Discussing about importance of ambient lighting techniques with our team. This screenshot is a perfect example how bad big shadowed areas look if you don't handle this problem in a good way. Image The sun light (point light) is not the only light source outdoors. The sky light (big area light) is super important too. You need at least some level of approximation of sky light visibility to make the outdoor rendering look good.
Sep 10, 2023 13 tweets 3 min read
I think I finally "understood" theory of relativity. The graphics programmer version: It's really just some normal map decode math and tonemapping :)

Thread... Image The space is 4d instead of 3d. First axis is time. We always move at speed of light. If object is stationary it moves at speed of light in time direction. Massless particles moving a speed of light don't move at all in time direction.
Sep 3, 2023 14 tweets 3 min read
Somebody asked why you can't return arrays (dynamic sized data) in C. There's a interesting technical limitation in stack based programming languages that prevents this. It's not possible to return dynamic amount of data using the stack.

Thread... We all know that alloca exists. This allows you to allocate dynamic amount of data from top of the stack. This works like any static sized stack variable. It dies at return. No issues there. Why is the return then a problem?
Aug 26, 2023 5 tweets 2 min read
RG11B10F is supported on our min spec phones (incl Android). 32 bpp + all mobile GPUs support framebuffer compression for it (RGBA16f framebuffer compression was introduced later).

This makes writing proper PBR pipeline easy. Also considering HDR output.

developer.apple.com/videos/play/ww…
Image On Xbox 360 RGB10FA2 (float) was supported as render target, but not as a texture. You had to manually unpack it from RGB10A2. Thus no filtering, etc. RGBA16F was half rate render / half rate filter. It was painful back then. People used RGBM8 and similar tricks.
Aug 5, 2023 51 tweets 9 min read
A thread about rollback network code.

Last weekend was Tekken 8 beta and many gamers were talking about rollback netcode (Tekken finally has it). I used to write network code in the past, including rollback netcode, so I will explain how it works in fighting games now... I will be limiting the discussion to deterministic games running at fixed step rate (60 simulation steps per second is the most popular choice). Deterministic simulation means that all clients agree 100% on the game state every frame, except in case of rollback of course...
Jul 19, 2023 18 tweets 4 min read
Thread comparing V-buffer and hardware TBDR approaches.

Some of my TBDR/POSH thoughts here:
When I was designing V-buffer style renderer in 2015 I was a bit concerned about having to run vertex shader 3 times per pixel. People might say that this is fine if you are targeting 1:1 pixel:triangle density like Nanite does, but that's cutting corners...
Jul 16, 2023 31 tweets 5 min read
Let's talk about fast draw calls on bottom 50% mobile devices. And let's ignore the bottom 10% so that we can assume Vulkan 1.0 support = compute shaders.

Why are we not doing GPU-driven rendering? Why instancing is not always a win? The first thing I want to talk about is memory loads on old GPUs. Before Nvidia Turing and AMD RDNA1 all buffer loads on these vendors were going though the texture samplers. Texture samplers have high latency (~100 cycles). The only exception was uniform buffer & scalar loads.
Jul 12, 2023 6 tweets 1 min read
@NinjaJez @dannthr @AeornFlippout In rendering, in the beginning of the shader you subtract the camera (int xyz) position from the object (int xyz) position. Then you convert the result to floating point. And do rest of the math in floating point. @NinjaJez @dannthr @AeornFlippout This way you get no precision issues regardless where you are in the world. Same of course applies to light sources. The camera/object/light matrix has xyz translate as integer and the 3x3 rotation/scale part as float.
May 9, 2023 5 tweets 2 min read
16 bit unorm is the best vertex position format. I have shipped many games using it. HypeHype will soon use it too (2.5x mem savings).

Precalc model xyz bounds and round it to next pow2 to ensure that you get zero precision issues with pow2 grid snapping (for kitbashed content). For 2.5x storage savings, I also store the tangent frame in a more dense way. We will use the same 16 bit UV encoding that Horizon Forbidden West uses. Image
May 3, 2023 6 tweets 3 min read
Splatting gaussians instead of ray-marching. Reminds me of particle based renderer experiments. Interesting to see whether gather or scatter algos win this round.

@NOTimothyLottes thoughs? @NOTimothyLottes Translation: Gaussians = stretched particles. Research papers need fancy terminology :)
May 2, 2023 4 tweets 1 min read
C++20 designated initializers, C++11 struct default values and a custom span type (with support for initializer lists) is a good combination for graphics resource creation: Image Declaring default values with C++11 aggregate initialization syntax is super clean. All the API code you need is this struct. No need to implement builders or other code bloat that you need to maintain. Image
May 2, 2023 7 tweets 2 min read
Managed to generate binding ids for the generated GLSL shader for GLES3/WebGL2 using SPIRV-Cross API.

GLES doesn't have sets, so I must generate a flat contiguous range of binding ids per set and store the set start index. Runtime binds N slots at a time (bind groups). Image I also must dump a mapping table for combined samplers in the shader. Our renderer has separate samplers and images. Image
May 2, 2023 6 tweets 2 min read
This applies to pretty much every GPU out there.

I haven't yet talked much about the GPU shader optimizations I plan for HypeHype. Heavy pass merging in post process stack is one of them.

Thread... I was talking about the new DOTS hybrid renderer GPU persistent data model 2 years ago at SIGGRAPH. We calculated object inverse matrices in the data upload shader, because that was practically free. ALU is free in shaders that practically just copy data around.
Apr 22, 2023 15 tweets 3 min read
Let's design a fast screen tile based local light solution for mobile and WebGL 2.0 (no compute). Per-object light list sounded good until I realized that we have a terrain. Even the infinite ground plane is awkward to lit with per-object light list.

Thread... No SSBOs. Uniform buffers are limited to 16KB (low end Android limitation). Up to 256 lights visible at once. Use the same float4 position + half4 color + half4 direction + cos angle setup that handles both point lights and directional lights. 32B * 256 lights = 8KB light array.
Apr 22, 2023 27 tweets 6 min read
I am implemented practically all the possible local light rendering algorithms during my career, yet I am considering trivial per-object list list for HypeHype.

Kitbashed content = lots of small objects. Granularity seems fine.

Thread... Setup all the visible local light source data into a UBO array at beginning of the render pass. For each object, uint32 contains four packed light (8 bit) light indices. In the beginning of the light loop, do binary AND to take lowest 8 bits and shift down 8 bits (next light).
Apr 21, 2023 7 tweets 2 min read
The high level agenda of my presentation looks currently like this. Each of the main topics have a lot of sub-topics of course.

If you find anything missing that you would want to hear about, please reply in the thread. Image Correction: Backend doesn't process or setup data

The platform specific backend code just passes handles and offsets around, so that the data provided directly by the user land code is visible in the shaders. Zero copies and no backend refactoring when data layout changes.