Tiago Ramalho Profile picture
Jun 16, 2022 16 tweets 5 min read Read on X
The next big breakthrough in AI will come from hardware, not software.

Training giant models like PaLM already require 1000s of chips consuming several MW, and we will probably want to keep scaling these up several orders of magnitude. How can we do it? a 🧵
All computations done by a neural network are ultimately a series of floating point operations.

To do a floating point operation, two (or three) numbers need to be loaded from memory into a circuit that performs the calculation, and the result needs to be stored back in memory. Image
This style of computing is called a Von Neumann machine. Image
Loading those numbers from memory costs an extraordinary amount of energy compared to doing the actual computation itself (1000x).

Electrons need to flow across a wire several cm long, and are dissipating energy as heat all along the way. Image
A big advantage of this architecture is that it allows us to do very general types of computations, which is what spurred the great IT revolution of the last few decades. Digitalization can be applied to basically anything we can think of.
But such a general architecture will necessarily not be the most optimized if we are looking at a specific computation type.

If we know in advance what floating point numbers we want to multiply, it doesn’t make much sense storing them on a large memory pool far far away. Image
The first step to optimizing this computation is to try to reuse the same values as often as possible, as is done in GPUs, which can load one number and reuse it across several computation units Image
Systolic arrays are even more efficient, mapping the matrix-matrix multiplication directly to hardware (TPU, Tensor Core), and allowing us to reuse parameters as much as possible. Image
But these architectures still rely on keeping the parameters on an external memory pool. That means that a lot of time is spent waiting for the numbers to move back and forth from the memory chip.
To make memory transfers faster, we need to increase the memory clock speed, which increases energy consumption even more.

(See HBM3 memory chips next to an H100 core. This chip consumes over 700W) Image
Some new startups (Graphcore, Cerebras) are starting to directly design the processing chips with memory and compute mixed together. This reduces the memory transfer overhead to some extent. Image
However, our brains are way more efficient! The “parameters” in a biological neural network are represented by the strength of synaptic connections, and the computation is done directly on the neuron that receives the signal. Memory and computation happen in the same unit. Image
Some memory manufacturers are working on in-memory processing systems to truly mix computing and memory in a way that’s much closer to the neuronal processing model.
The tradeoff is that the computations that can be implemented are less general, and to make these chips profitably there needs to be a large, established use-case for them. Image
At this point, neural network architectures are still evolving quickly and it may create some uncertainty for these manufacturers to settle on a design to mass manufacture. So these kinds of architectures may take a while to mature.
Besides memory locality, there are other factors which could improve efficiency of AI algorithms by orders of magnitude such as sparse computation and analog processing. I’ll cover those later. End of 🧵

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tiago Ramalho

Tiago Ramalho Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tmramalho

Jul 7, 2022
Research Engineers are the people training and tuning state-of-the-art models like GPT-3, DALL-E, Imagen, Alphafold, etc

I spent 100s of hours coaching and mentoring junior and mid-career REs.

Here’s how you become an RE at a top-tier institution (Google, Meta, OpenAI,…) 🧵
REs need to be jacks of all trades, master of most.

They are the glue between theory papers and the implementations of those algorithms running on advanced chips in the cloud.
And they’re doing that as part of a large team comprising people of different backgrounds.

It’s a role that requires a ton of flexibility.

One day an RE may be coding a CUDA kernel, another day they may be preparing a talk to explain the latest model to molecular biologists.
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(