Kostas Anagnostou Profile picture
Lead Rendering Engineer @WeArePlayground working on @Fable. DMs open for graphics questions or mentoring people who want to get in the industry. Views my own.

Jul 8, 2019, 11 tweets

I got an interesting question via Twitter today: why would a single, high polycount mesh (>5m tris) render slowly? Without knowing more about the platform and actual use, off the top of my head (thread - please add any potential reasons I have missed):

The vertex buffer may be too fat in terms of data types and/or stores more data than we need. "Smaller" formats (bytes, half, or oven fixed point) may help.

Also using a structure of arrays layout (a different vertex stream per attribute -- position, normal etc) can make it easier to only bind vertex data that we need and increase cache hits.

I also like the variable frequency data storing idea suggested by @SebAaltonen: if possible store per triangle (or less) data in a separate stream, index in the vertex shader. Could share index bits for that purpose.

Is an index buffer actually used? Without an index buffer you can't use the post transform vertex cache, and you may have to transform same vertices multiple times.

With a >5m tri mesh, a 32bit index buffer might increase bandwidth requirements, maybe worth investigating a triangle strip topology.

Is a lot of data passed from vertex shader to pixel shader? The parameter cache between vertex and pixel shader could become a bottleneck if yes.

It is harder to distance sort triangles in a single large mesh and this could lead to overdraw in the pixel shader. Additionally, frustum culling can't cull off-screen triangles, unnecessarily paying the cost of vertex shading them.

High density meshes can lead to a lot of sub-pixel triangles, leading to wasted vertex shading of triangles that'll get rejected by the rasteriser.

The mesh vertices may not be cache optimised, the triangle layout may force the gpu to revisit an already transformed vertex much later (which by then may have already been evicted).

Those reasons can affect smaller meshes as well of course, with a large drawcall set up costs are usually minimal and the big hit may come from bandwidth and inefficient processing.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling