Latest Twitter Threads by @SebAaltonen on Thread Reader App

May 30 • 4 tweets • 1 min read

The past decades have been a wonderful time for gamers+devs. The biggest chips, using the latest nodes and trillions worth of R&D, were all targeted at gaming. Now, those chips are needed by professionals (AI). We'll never see a big die GPU at a reasonable price point anymore :( The fun lasted for a very long time, but it's over in both CPU and GPU side. The biggest CPU and GPU dies are no longer designed for gamers. Top end Threadripper costs over 10k$ today. Top end Nvidia B200 costs over 30k$. Few generations ago top tier HW was targeting gamers :(

May 18 • 9 tweets • 2 min read

Unit tests have lots of advantages, but cons are ignored:
- Code must be split to testable parts. Often requiring more interfaces, which add code bloat and complexity.
- Each call site is a dependency. Test case = +1 dependency. Added inertia to refactor and throw away code.
...

https://twitter.com/kerckhove_ts/status/1923365950876729726

- Bloated unit test suites taking several hours to execute. Slows down devs and causes merge conflicts as pushes are delayed.
- Unstable tests randomly failing pushes.
- Unit test maintenance and optimization needed to keep tests manageable. Otherwise developer velocity hurts.

May 7 • 5 tweets • 2 min read

When you split a function to N different small functions, the reader also suffers multiple "instruction cache" misses (similar to CPU when executing it). They need to jump around the code base to continue reading. Big linear functions are fine. Code should read like a book.

https://twitter.com/Jonathan_Blow/status/1919847600527716561

Messy big functions with lots of indentation (loops, branches) should be avoided. Extracting is a good practice here. But often functions like this are a code smell. Why do you need those branches? Why is the function doing too many unrelated things? Maybe too generic? Refactor?

Jan 23 • 10 tweets • 2 min read

WebGPU CPU->GPU update paths are designed to be super hard to use. Map is async and you should not wait for it. Thus you can't map->write->render in the same frame.

wgpuQueueWriteBuffer runs in CPU timeline. You need to wait for callback to know buffer is not in use.

Thread... Waiting for callback is not recommended in web and there's no API for asking how many frames you have in flight. So you have to dynamically create new staging buffers (in a ring) based on callbacks to use wgpuQueueWriteBuffer safely. Otherwise it will trash data used by GPU.

Dec 17, 2024 • 6 tweets • 2 min read

I would really love to write a blog post about this whole debate. This is a too complex topic to discuss on Twitter.

Ubisoft was among the first to develop GPU-driven rendering and temporal upscaling. AC: Unity, Rainbow Six Siege, For Honor, etc. Our SIGGRAPH 2015 talk, etc...

https://twitter.com/ThreatInteract/status/1867084725988516101

Originally, TAA was seen as an improvement over screen space post-process AA techniques as it provided subpixel information. It wasn't just a fancy blur algo. Today, people render so much noise. Noise increases neighborhood bbox/variance. Which increases ghosting.

Nov 22, 2024 • 8 tweets • 2 min read

AMD UDNA will be interesting.

CDNA3 architecture is still based on GCN 4 cycle wave64 scheduling. RDNA schedules every cycle and exposes instruction latency. Scheduler runs/blocks instructions concurrently/dynamically. RDNA is much closer to Nvidia GPUs.

Thread...

CDNA has wide matrix cores and other wide compute workload improvements, which AMD wants to bring to UDNA. It also has multi-chip scaling.

Rumors tell that RDNA4 will finally have matrix cores in consumer space. Seems that AMD is integrating matrix cores early to RDNA lineup.

Nov 20, 2024 • 5 tweets • 1 min read

I am impressed about our new WebGPU WASM page load time. Whole engine loads in just a few hundreds of milliseconds. And games load pretty much instantly too.

This is how fast things should load. We have spent a massive effort optimizing the data structures. I also rewrote the renderer.

You can't get this kind of load times with lots of tiny memory allocs, shared pointers, hash map lookups, frequent mutexes and all other slow stuff.

Oct 5, 2024 • 30 tweets • 6 min read

Let's talk about CPU scaling in games. On recent AMD CPUs, most games run better on single CCD 8 core model. 16 cores doesn't improve performance. Also Zen 5 was only 3% faster in games, while 17% faster in Linux server. Why? What can we do to make games scale?

Thread... History lesson: Xbox One / PS4 shipped with AMD Jaguar CPUs. There was two 4 core clusters with their own LLCs. Communication between these clusters was through main memory. You wanted to minimize data sharing between these clusters to minimize the memory overhead.

Jun 27, 2024 • 8 tweets • 2 min read

I've been talking about mobile vs desktop GPUs lately. I wanted to clarify that I focus on mainstream GPUs, which are popular among younger audiences. Cheap <200$ Androids and 5+ year old Apple GPUs.

Apple's newest GPU is actually very advanced:

https://x.com/SebAaltonen/status/1722904062444638465

The linked thread is speculation mixed with Apple's marketing material. Some details might be wrong.

This is the holy grail. Running branchy shaders on GPU efficiently. No need to compile millions of shader variants to statically branch ahead of time. Apple is doing it first.

Jun 25, 2024 • 14 tweets • 3 min read

This is why I have concerns about GPU-driven rendering on mobile GPU architectures. Even on the latest Android phones.

Rainbow Six: Siege is using a GPU-driven renderer (Ubisoft had 3 teams doing a GPU-driven rendered game back then). Intel iGPU runs it 2.65x faster.

TimeSpy does volume ray-marching and per pixel OIT. Consoles got proper 3d local volume texture support recently. My benchmark using M1 Max back int the day suggests that even Apple doesn't have 3d local volume textures in their phones. Maybe in M3. Didn't run my SDF test in it.

May 19, 2024 • 19 tweets • 4 min read

Recently a book was released about Nokia's N-Gage handheld game console. Back then Nokia was one of the most valuable companies in EU. They spent lots of money to conquer the gaming market.

I was the lead programmer in their flagship game.

Thread...

hs.fi/taide/art-2000… The book realistically covers how crazy the spending of the money was. Nokia spent massive amount of money into marketing. They wanted to look like a gaming company. Big stands at GDC and other places. Massive launch parties, lots of press invited, etc, etc.

Feb 14, 2024 • 8 tweets • 2 min read

I wrote my first engine (called Storm3D) during my University studies. Licensed it to FrozenByte and some other Finnish game companies. They shipped a few games using it.

It was a classic object oriented C++ code base with diamond inheritance. That was cool back in the day :)

https://twitter.com/reduzio/status/1757684787450392902

In year 2000 object oriented programmer was based on real world examples "hawk is a bird" and people tried to solve diamond pattern issues with birds that were not flying. Later people realized that real world abstractions are bad. You should inherit/compose "flying" instead.

Dec 27, 2023 • 24 tweets • 4 min read

Have been thinking about possible GPU-driven rendering approaches for older mobile GPUs. Traditional vertex buffers, no bindless textures, and no multidraw. Thread... If you put all your meshes in one big mesh cache (1 buffer), you can simply change the mesh by modifying indirect draw call start index and index count. With indirect instanced draw call, you can modify the number of instances of that type.

Dec 22, 2023 • 9 tweets • 2 min read

Intel dropped Larrabee for a reason. The workloads just didn't exist back then. Dedicated AI processors are still niche.

In 2009 games were designed for Xbox360 & PS3. No compute shaders. Simple pixel/vertex workloads with minimal control flow...

tomshardware.com/pc-components/… The simple SIMD design of Nvidia and AMD GPUs was just better fit for both power and performance for running graphics workloads of that era. Larrabee used modified P54C cores. It had 4 wide hyperthreading and wider SIMD units, but it was still more like a CPU than GPU.

Oct 26, 2023 • 6 tweets • 2 min read

New cascaded shadows GPU time = 68%

More precise culling helps a lot.

It's still a single render pass. 2048x2048 texture atlas. Three 1024x1024 cascades in it (last 1024x1024 region saved for local lights).

Let's talk about the correct way to cull shadows...

The old shadow rendering fit one shadow map to the whole frustum. Then it culled all objects using the shadow frustum, which is a ortho box. Object in areas marked with red can't cast a shadow to the visible frustum geometry. ~3x shadow draw count. Big waste.

Oct 12, 2023 • 8 tweets • 2 min read

Figured out a fast way to do prefix sum of a byte mask (for draw stream compaction).

Let's assume we have a byte mask. 8 bytes inside a 64 bit integer. Each byte has their first bit set if that slot has data. Their values are 1, 256, 65536, etc...

Thread... If we want to propagate a bit to all higher bits, we first multiply that bit by 1+256+65536+16M+... This is a fixed constant. Then we add this number to the result. Now all higher bits in the 64 bit uint have their counters increased if the first bit was one...

Sep 18, 2023 • 8 tweets • 2 min read

Discussing about importance of ambient lighting techniques with our team. This screenshot is a perfect example how bad big shadowed areas look if you don't handle this problem in a good way.

The sun light (point light) is not the only light source outdoors. The sky light (big area light) is super important too. You need at least some level of approximation of sky light visibility to make the outdoor rendering look good.

Sep 10, 2023 • 13 tweets • 3 min read

I think I finally "understood" theory of relativity. The graphics programmer version: It's really just some normal map decode math and tonemapping :)

Thread...

The space is 4d instead of 3d. First axis is time. We always move at speed of light. If object is stationary it moves at speed of light in time direction. Massless particles moving a speed of light don't move at all in time direction.

Sep 3, 2023 • 14 tweets • 3 min read

Somebody asked why you can't return arrays (dynamic sized data) in C. There's a interesting technical limitation in stack based programming languages that prevents this. It's not possible to return dynamic amount of data using the stack.

Thread... We all know that alloca exists. This allows you to allocate dynamic amount of data from top of the stack. This works like any static sized stack variable. It dies at return. No issues there. Why is the return then a problem?

Aug 26, 2023 • 5 tweets • 2 min read

RG11B10F is supported on our min spec phones (incl Android). 32 bpp + all mobile GPUs support framebuffer compression for it (RGBA16f framebuffer compression was introduced later).

This makes writing proper PBR pipeline easy. Also considering HDR output.

developer.apple.com/videos/play/ww…

On Xbox 360 RGB10FA2 (float) was supported as render target, but not as a texture. You had to manually unpack it from RGB10A2. Thus no filtering, etc. RGBA16F was half rate render / half rate filter. It was painful back then. People used RGBM8 and similar tricks.

Aug 5, 2023 • 51 tweets • 9 min read

A thread about rollback network code.

Last weekend was Tekken 8 beta and many gamers were talking about rollback netcode (Tekken finally has it). I used to write network code in the past, including rollback netcode, so I will explain how it works in fighting games now... I will be limiting the discussion to deterministic games running at fixed step rate (60 simulation steps per second is the most popular choice). Deterministic simulation means that all clients agree 100% on the game state every frame, except in case of rollback of course...

Share this page!

Enter URL or ID to Unroll