Profile picture
Fabian Giesen @rygorous
, 16 tweets, 16 min read Read on Twitter
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen no, quads get enqueued to the ROP _before_ the PS starts.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen Blend depends on a (long-latency) DRAM read (of the previous pixel contents). That latency needs to be hidden, just like other memory latencies do. But the blend happens in the ROP stage.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen As a quad is output from the rasterizer, two things happen:
1. it gets packed into the next available slot in a warp (or wavefront) for the PS, launch when full (or on flush, e.g. shader change)
2. it gets sent to the ROP that owns it, which gives is a queue entry and
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen starts the DRAM read if required. (Similar thing also happens for the depth/stencil portion). Eventually that DRAM read returns and the queue entry gets the data below that quad poked into it.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen PS execution is decoupled from the ROP fetch for the most part. Not sure if the export is lockstep (i.e. the active PS needs to be at the head of the queue for the export to happen) or if there's some slack (i.e. an extra queue in between),
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen but for the ROP to "retire" a queue entry, it needs 1. the DRAM read (if it happened) to have finished, 2. the export data to be available (i.e. PS done), 3. the blend to have actually computed the result, at which point the new pixel data is written back.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen NB there is not actually a DRAM request per quad. There's _lots_ of coalescing going on to try and create long burst reads and writes where possible. But the ROP is the thing that guarantees in-order blend "retirement"; PS wave/warp execution is _not_ lockstep.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen To make it "properly" OoO while preserving semantics, you'd need a lot more tracking infrastructure a la OoO CPU load/store queues, which are actually _not_ proper queues, because they need to scan for ordering dependencies on every access.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen Doing this for a queue that's needs to be deep enough to hide a DRAM access (still >=100 cycles even at GPU clock freqs) is decidedly icky, not to mention power-hungry, and not something you do unless you're desperate. :P
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen Specifically, the _really nice thing_ you can do if and only if your memory access pipe is completely in-order is to build the whole thing as a cache where the tags are 100+ (however deep your queue is) cycles ahead of the actual contents of the data array.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen On enqueue, you check the tags to see if you already "have" the data. If not, you send out a request for that "cache line" (that's the actual DRAM read) and mark the line as present _immediately_ and go on with the next request. The data will not be there until 100+ cycles later.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen On dequeue, said 100 cycles later, is where the actual cache data array lives. Any memory returns get poked into the cache then (and it gets updated with the blend results). That data is immediately available for subsequent ROP ops that want to hit that same cache line.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen Eventually that line is evicted and the data is written back. Note that all the coalescing here happens basically implicitly purely based on "there are cache lines", and no loads or stores are ever bounced or re-played.
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen Really nice, really elegant, really efficient, but only works if the control logic on the enqueue knows for certain exactly what's going to be said cache 100 cycles later. Hence the in-order requirement. :)
@matiasgoldberg @adamjmiles @kenpex @SebAaltonen (It's awesome! You don't even keep full addresses around! You do your tag lookup, it tells you the line index where the data is going to be in the cache when you need it as well as the number of the most recent outstanding DRAM request you depend on, and that's it.)
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Fabian Giesen
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!