Read on Twitter

12,399 views

@SebAaltonen

, 34 tweets, 7 min read Read on Twitter

This is how I managed to port Claybook from consoles to ~4x slower handheld. Start state: frame rate = 60 fps locked, resolution = temporally upscaled 1080p on Xbox One base model (4K on pro consoles)...

Handheld has 720p display resolution. That cuts down pixel count by ~2x. Drop frame rate to 30 fps for another 2x cut. Now pixel bound passes are around 4x faster.

Drop shadow map res from 512x512 to 384x384. This is -25% on each dimension. Result = only a small quality drop. But x^2 scaling drops pixel processing cost down to 56%, which is very nice. Rest of the smap difference comes from HW differences. GCN2 is slow in geom bound cases.

Now if we were a normal well optimized 60 fps @ 1080p game, such as Doom, we would simply do some additional micro-optimizations and be done. Unfortunately Claybook uses 25% of GPU time on Xbox One base model for physics simulation. Only 75% for rendering...

GPGPU physics needs tp be ran at 60 fps. If render = 30 fps, you need double tick physics per frame. Physics cost doesn’t scale down by render resolution either. So 25% of frame * 4 = 100% of the frame. Physics would take all GPU time on mobile. Nothing left for rendering...

So we need to scale down physics cost also by 4x to hit our target. I decided to achieve this goal by reducing particle counts by 2x and also optimizing all simulation code to be 2x faster. Thus it would be 4x faster in total. Let’s see what this means...

Let’s increase fluid particle radius by 26%. The volume taken by the particle is 1.26^3 = 2.00. Thus you need 1/2 of particles to fill the same volume. This is excellent! Only slight increase in particle radius and 2x reduced processing cost.

For clay sim, we only have particles on clay surface, thus scaling is x^2 instead of x^3. For example if one side of a clay cube was 142x142 particles, we need to scale down to 100x100 to halve the particle count of the cube. Which is only 30% reduction in linear dim quality.

Now we have the hardest task left. How to make every step in the physics simulator 2x faster, while retaining the quality. This is pure optimization work. This is where my last 6 weeks of the porting process was spent. I will now go though some of these optimizations...

The first task was to rewrite fluid simulator data structure. We had a grid that stored particle ids. Particle data was in one linear array. As simulation progressed particles in local nbhood were no longer local in memory (same cache line). L1$ hit ratio was 60%. That’s bad...

I double buffered the whole grid and put all particle data in the grid. During grid building, I allocate (atomic inc) region of memory for each 4x4x4 grid region. Thus each 4x4x4 region of grid cells is adjacent in memory. L1$ hit increased to 90%. Fluid sim runs 2x faster.

Fluid SDF volume generation took 5ms. It’s an indirect dispatch (4x4x4 tiles) and processes all fluid tiles that had particles in them. Empty tiles were not processed, but fully filled (inside) tiles were processed. Filled tile = most expensive...

I added early out to the fluid SDF generation shader with an heuristic to detect full tile without accessing the particles. It simply loaded grid particle counts (which were needed by the shader later) and had conservative estimate based on max pressure. SDF gen = 30% faster.

SDF gen needed further ALU opts. It had ALU heavy particle loop. Exp min formula had e based exp (two instructions), replaced it with base 2 exp and moved the multiply out of the loop. Then did some preprocessing to particles stored to groupshared mem before loop. 10% faster.

Fluid SDF resolution was reduced to closest to pow2 size 256x256x128 to 192x192x96. This 25% reduction in linear dimensions, but results in 2.37x reduction in processing cost. Combined with the optimizations above, it was close enough to being 4x faster.

Grid generation shaders were never a bottleneck on consoles. 5 dependent pass chain, but running in async compute. I interleaved fluid grid and SDF modification grid generation to halve the barrier count. Then I merged two grid passes together to further shave one barrier.

There was only one ”expensive” grid pass. It dilated grid and generated two mips of grid masks and generated all indirect tile coordinate arrays. This pass was cheap on consoles, so it was never optimized. I added early out optimization for fully empty 4x4x4 tiles (2x perf)...

https://twitter.com/sebaaltonen/status/1052892463491411968?s=21

https://twitter.com/sebaaltonen/status/1052892463491411968?s=21

Then I also noticed that there was some super heavy ALU sequence in this shader and got rid of that too:

https://twitter.com/sebaaltonen/status/1052892463491411968?s=21

. Now the grid generation was 4x faster. Another goal met.

https://twitter.com/sebaaltonen/status/1011211972904472576?s=21

https://twitter.com/sebaaltonen/status/1011211972904472576?s=21

Before going to the clay simulation specifics, I want to point out that I also optimized ray-traced soft shadows. Biases (epsilons) were slightly tweaked (no visual impact, 2% perf difference). But the biggest optimization was this:

https://twitter.com/sebaaltonen/status/1011211972904472576?s=21

The first clay simulation optimization was to use per clay mesh alive particle count (stored) instead of per particle alive flag. We support up to 16k particles per mesh, but common count was around 10k to 14k. Particles were compacted and area local, but...

We used if(alive) branch in many places. Now instead I calculated the alive mask for each lane from the stored mesh particle counter and added early outs at beginning of each particle group if no particle was alive. (not all passes were indirect). 20% perf gain.

The above optimization was especially nice for shape match reduction and gameplay reduction passes. There was no need to reduce (groupshared log2 reduction) the alive flags anymore to get the counts (for sum dividers).

I also implemented lots of smaller ALU optimizations for clay physics simulation steps.

We modify world SDF every frame. Thousands of small cuts. SDF generation shaders were quite slow. Groupshared memory storage was optimized. This resulted in higher occupancy of these shaders, increasing performance by 50%. This opt also affected fluid SDF gen (above).

https://twitter.com/sebaaltonen/status/1059795412918460416?s=21

https://twitter.com/sebaaltonen/status/1059795412918460416?s=21

Every frame we run grid generation shader to generate also collision grid and gameplay trigger grid. These shaders were dirt cheap on consoles, but on handheld total cost was almost 1 millisecond. Groupshared mem optimization made these shaders 5x faster:

https://twitter.com/sebaaltonen/status/1059795412918460416?s=21

I would have done more hardware specific scalar->cbuffer optimizations but UE4 doesn’t expose any way to modify constant buffers using compute shaders. Thus I analyzed common case optimizations that helped AMD+Nvidia+Intel. Result = low end Intel iGPUs now also run 35% faster.

I used two Nvidia GPUs (low end Maxwell + Titan X Pascal), AMD Vega 64, four GCN based consoles and Intel NUC as my profiling devices. Analyzed bottleneck on all. Mostly non-platform specific optimizations. Result = iPhone runs Claybook at 30+ fps now too :)

In addition to the handheld console, I ported Claybook to Metal and Vulkan too to allow further porting. Perf looks very good on first phone we tested (iPhone X).

End result = We didn’t need to cut any visual effects or remove any gameplay features or cut levels. Handheld visuals are very close to consoles. Of course resolution is lower, but screen is much smaller too. Temporal upscaling works very well with Claybook content.

https://twitter.com/sebaaltonen/status/1073529499487215617?s=21

https://twitter.com/sebaaltonen/status/1073529499487215617?s=21

Claybook handheld port is a polar opposite of this:

https://twitter.com/sebaaltonen/status/1073529499487215617?s=21

Hitting 60 fps on base Xbox was a huge effort too (proto ran at 20 fps). I did more optimization work during the original PC+console project. But once we hit 60 fps, we stopped optimizing. These handheld optimizations mostly affected passes that were fast on consoles...

It would be an easy mistake to conclude that our code base was badly optimized as I managed so many 2x+ optimizations. But on consoles most of these passes were 0.1 ms or less. Time spent on optimizing these shaders wouldn’t have been worth it.

https://twitter.com/sebaaltonen/status/990905641047875584?s=21

https://twitter.com/sebaaltonen/status/990905641047875584?s=21

So instead to hit 4K, I optimized lots of UE4 shaders. Such as making PostProcessHistogram 9x faster:

https://twitter.com/sebaaltonen/status/990905641047875584?s=21

https://twitter.com/sebaaltonen/status/983633889762725888?s=21

https://twitter.com/sebaaltonen/status/983633889762725888?s=21

https://twitter.com/sebaaltonen/status/983633889762725888?s=21

Like this thread? Get email updates or save it to PDF!

Subscribe to Sebastian Aaltonen

This content may be removed anytime!

Try unrolling a thread yourself!

Trending hashtags

Like this thread? Get email updates or save it to PDF!

Subscribe to Sebastian Aaltonen

This content may be removed anytime!

Try unrolling a thread yourself!

More from @SebAaltonen see all

Related threads

Trending hashtags

Did Thread Reader help you today?