1/11 Ever wondered how Supercomputers multiply? Efficient multiplication is a critical part of traditional high performance computing as well as machine learning. Let's look at the #IBM Blue Gene/Q, which led the #Top500 and #Green500 in the late 2000s. en.wikipedia.org/wiki/IBM_Blue_…
2/11 IBM Open Sourced the core to Blue Gene/Q (the @OpenPOWERorg A2I) so it offers a fantastic insight into a CPU that taped out at frequency and in volume. The #VHDL for the core is available at: github.com/OpenPOWERFound…
3/11 We quickly see that it uses a few well established techniques for optimising multiplication. The first step uses Booth encoding: en.wikipedia.org/wiki/Booth%27s…
4/11 This trick reduces the number of partial products by about half. Here we can see the VHDL for the Booth encoder: github.com/openpower-core…
5/11 The next step is to sum these partial products together. Blue Gene/Q uses a Wallace tree multiplier for this: en.wikipedia.org/wiki/Wallace_t…
6/11 Some ASCII art showing one of the steps of the Wallace tree multiplier in the VHDL is at: github.com/openpower-core…
It shows 9 partial products being converted using three sets of adders, producing three sets of sum and carry bits.
7/11 Wallace tree reduction is repeated until there are only two partial products left. These have to be added together and Blue Gene/Q uses a carry select adder for this: en.wikipedia.org/wiki/Carry-sel…
The 64 bit adder is broken up into 8x8 bit chunks. Each chunk calculates the answer for both a carry in of 0 and a carry in of 1, and a mux is used to select which one is right.
9/11 Carry select adders are one technique out of many that prevents the carry chain from becoming a timing issue. A naive implementation of an adder (a ripple carry adder) propagates the carry chain through the entire adder, which for a 64 bit adder means 64 full adder delays.
10/11 While this is a basic overview, there's much more I've left out. In the Wallace tree reduction step, instead of using full adders (also called a 3:2 compressor because it takes 3 bits of input and produces 2 bits of output), Blue Gene/Q sometimes uses 4:2 compressors.
11/11 There's also evidence it might use a combination of custom cells and/or custom circuit design to help with layout density and timing.
• • •
Missing some Tweet in this thread? You can try to
force a refresh