1/11 Ever wondered how Supercomputers multiply? Efficient multiplication is a critical part of traditional high performance computing as well as machine learning. Let's look at the #IBM Blue Gene/Q, which led the #Top500 and #Green500 in the late 2000s.
en.wikipedia.org/wiki/IBM_Blue_…
2/11 IBM Open Sourced the core to Blue Gene/Q (the @OpenPOWERorg A2I) so it offers a fantastic insight into a CPU that taped out at frequency and in volume. The #VHDL for the core is available at:
github.com/OpenPOWERFound…
3/11 We quickly see that it uses a few well established techniques for optimising multiplication. The first step uses Booth encoding:
en.wikipedia.org/wiki/Booth%27s…
4/11 This trick reduces the number of partial products by about half. Here we can see the VHDL for the Booth encoder:
github.com/openpower-core…
5/11 The next step is to sum these partial products together. Blue Gene/Q uses a Wallace tree multiplier for this:
en.wikipedia.org/wiki/Wallace_t…
6/11 Some ASCII art showing one of the steps of the Wallace tree multiplier in the VHDL is at:
github.com/openpower-core…

It shows 9 partial products being converted using three sets of adders, producing three sets of sum and carry bits.
7/11 Wallace tree reduction is repeated until there are only two partial products left. These have to be added together and Blue Gene/Q uses a carry select adder for this:
en.wikipedia.org/wiki/Carry-sel…
8/11 You can see the final stage of the carry select adder in the VHDL at:
github.com/OpenPOWERFound…

The 64 bit adder is broken up into 8x8 bit chunks. Each chunk calculates the answer for both a carry in of 0 and a carry in of 1, and a mux is used to select which one is right.
9/11 Carry select adders are one technique out of many that prevents the carry chain from becoming a timing issue. A naive implementation of an adder (a ripple carry adder) propagates the carry chain through the entire adder, which for a 64 bit adder means 64 full adder delays.
10/11 While this is a basic overview, there's much more I've left out. In the Wallace tree reduction step, instead of using full adders (also called a 3:2 compressor because it takes 3 bits of input and produces 2 bits of output), Blue Gene/Q sometimes uses 4:2 compressors.
11/11 There's also evidence it might use a combination of custom cells and/or custom circuit design to help with layout density and timing.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anton Blanchard

Anton Blanchard Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(