Post

More from @SebAaltonen

Sebastian Aaltonen

@SebAaltonen

Sep 19

I have realized that there's not that many people out there who understand the big picture of modern GPU hardware and all the APIs: Vulkan 1.4 with latest extensions, DX12 SM 6.6, Metal 4, OpenCL and CUDA. What is the hardware capable of? What should a modern API look like?

My "No Graphics API" blog post will discuss all of this. My conclusion is that Metal 4.0 is actually closest to the goal. It has flaws too. DX12 SM 6.6 doesn't have those particular flaws, but has a lot of other flaws. Vulkan has all the flaws combined, with useful extensions :)

Of course WebGPU doubled down on Vulkan's design mistakes. Bind groups are immutable and there's no escape hatches for dynamic bindings. No persistently mapped GPU memory. And a brand new shader language without 64-bit pointer support.

Read 20 tweets

Sebastian Aaltonen

@SebAaltonen

May 30

The past decades have been a wonderful time for gamers+devs. The biggest chips, using the latest nodes and trillions worth of R&D, were all targeted at gaming. Now, those chips are needed by professionals (AI). We'll never see a big die GPU at a reasonable price point anymore :(

The fun lasted for a very long time, but it's over in both CPU and GPU side. The biggest CPU and GPU dies are no longer designed for gamers. Top end Threadripper costs over 10k$ today. Top end Nvidia B200 costs over 30k$. Few generations ago top tier HW was targeting gamers :(

AMD no longer produces big-die GPUs for gamers. Nvidia has a low-volume 2500$+ Halo product. But it's much smaller than Nvidia's B200 GPU, which has two glued dies, each slightly bigger than RTX 5090. Chiplet GPUs like Threadripper are coming. Gaming GPUs limited to few chiplets?

Read 4 tweets

Sebastian Aaltonen

@SebAaltonen

May 18

https://twitter.com/kerckhove_ts/status/1923365950876729726

Unit tests have lots of advantages, but cons are ignored:
- Code must be split to testable parts. Often requiring more interfaces, which add code bloat and complexity.
- Each call site is a dependency. Test case = +1 dependency. Added inertia to refactor and throw away code.
...

https://twitter.com/kerckhove_ts/status/1923365950876729726

- Bloated unit test suites taking several hours to execute. Slows down devs and causes merge conflicts as pushes are delayed.
- Unstable tests randomly failing pushes.
- Unit test maintenance and optimization needed to keep tests manageable. Otherwise developer velocity hurts.

It's crucial to make your unit tests fast. Don't load files from disk and definitely don't do network requests. Embed data (bin->hdr tool for example). If your whole test suite runs in <10 seconds, then you are golden. But writing good optimized tests like this takes effort.

Read 9 tweets

Sebastian Aaltonen

@SebAaltonen

May 7

https://twitter.com/Jonathan_Blow/status/1919847600527716561

When you split a function to N different small functions, the reader also suffers multiple "instruction cache" misses (similar to CPU when executing it). They need to jump around the code base to continue reading. Big linear functions are fine. Code should read like a book.

https://twitter.com/Jonathan_Blow/status/1919847600527716561

Messy big functions with lots of indentation (loops, branches) should be avoided. Extracting is a good practice here. But often functions like this are a code smell. Why do you need those branches? Why is the function doing too many unrelated things? Maybe too generic? Refactor?

There's a rule of thumb that you write separate code for each call site until you have repeated yourself 3 times. Then you merge these together. But people often forget the opposite: You have to split a function if the call site requirements change. Don't add more branches!

Read 5 tweets

Sebastian Aaltonen

@SebAaltonen

Jan 23

WebGPU CPU->GPU update paths are designed to be super hard to use. Map is async and you should not wait for it. Thus you can't map->write->render in the same frame.

wgpuQueueWriteBuffer runs in CPU timeline. You need to wait for callback to know buffer is not in use.

Thread...

Waiting for callback is not recommended in web and there's no API for asking how many frames you have in flight. So you have to dynamically create new staging buffers (in a ring) based on callbacks to use wgpuQueueWriteBuffer safely. Otherwise it will trash data used by GPU.

You are not allowed to map or wgpuQueueWriteBuffer a different region of a buffer used by any GPU frame in flight. You need entirely different buffer.

Read 10 tweets

Sebastian Aaltonen

@SebAaltonen

Jan 22

Refactored our CommandBuffer interface to support compute. Final result:

A compute pass contains N dispatches, just like a render pass contains N draws (split into areas = viewports).

Renderpass object is static (due to Vulkan 1.0). Compute has dynamic write resource list.

This is how you would use the API to dispatch a compute pass with a single compute shader writing to two SSBOs.

This new API requires one virtual function call per pass, which is not a problem. The passed data commonly lives in the stack or is temp allocated (frame bump allocator). No copies (span is just ptr + size). And initializer lists (if used) live in caller stack.

Read 10 tweets

Share this page!

Enter URL or ID to Unroll

Sebastian Aaltonen

Try unrolling a thread yourself!

More from @SebAaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Sebastian Aaltonen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!