adammaj Profile picture
research @openai // on leave @penn

Apr 25, 2024, 11 tweets

I've spent the past ~2 weeks building a GPU from scratch with no prior experience. It was way harder than I expected.

Progress tracker in thread (coolest stuff at the end)👇

Step 1 ✅: Learning the fundamentals of GPU architectures

I started by trying understand how modern GPUs function down to the architecture level.

This was already harder than I anticipated - GPUs are proprietary tech, so there are few detailed learning resources online.

I started out trying to understand the GPU software pattern by learning about NVIDIA's CUDA framework.

This helped me understand the Same Instruction Multiple Data (SIMD) programming pattern used to write GPU programs called kernels.

With this context, I dove into learning about the core elements of GPUs:
> Global Memory - external memory that stores data & programs accessing it is a huge bottleneck & constraint on GPU programming
> Compute Cores - the main compute units that execute kernel code in different threads in parallel
> Layered Caches - caches to minimize global memory access
> Memory Controllers - handles throttling requests to global memory
> Dispatcher - the main control unit of the GPU that distributes threads to available resources for execution

And then within each compute core, I learned about the main units:
> Registers - dedicated space to store data for each thread.
> Local/Shared Memory - memory shared between threads to pass data around to each other
> Load-Store Unit (LSU) - used to store/load data from global memory
> Compute Units - ALUs, SFUs, specialized graphics hardware, etc. to perform computations on register values
> Scheduler - manage resources in each core and plans when instructions from different threads get executed - much of GPU complexity lies here.
> Fetcher - retrieves instructions from program memory
> Decoder - decode instructions into control signals

This process gave me a good high-level understanding of the different units in modern GPUs.

But with so much complexity, I knew I had to cut down the GPU to the essentials for my own design or else my project would be extremely bloated.

Step 2 ✅: Creating my own GPU architecture

Next, I started to create my own GPU architecture based on what I learned.

My goal was to create a minimal GPU to highlight the core concepts of GPUs and remove unnecessary complexities so other's could learn about GPUs more easily.

Designing my own architecture was an incredible exercise in deciding what really matters.

I went through several iterations of my architecture throughout this process as I learned more by building.

I decided to highlight the following in my design:
> Parallelization - How is the SIMD pattern implemented in hardware?
> Memory Access - How do GPUs handle the challenges of accessing lots of data from slow & limited bandwidth memory?
> Resource Management - How do GPUs maximize resource utilization & efficiency?

I wanted to highlight the broader use-cases of GPUs for general purpose parallel computing (GPGPUs) & ML so I decided to focus on the core functionality rather than graphic specific hardware.

After many iterations, I finally landed on the following architecture that I implemented in my actual GPU (everything is in it's simplest form here)

Step 3 ✅: Writing a custom assembly language for my GPU

One of the most critical elements was that my GPU could actually execute kernels written with the SIMD programming pattern.

In order to make this possible, I had to design my own Instruction Set Architecture (ISA) for my GPU that I could use to write kernels.

To enable this, I made my own small 11 instruction ISA inspired by the LC4 ISA to allow me to write some simple matrix math kernels as a proof of concepts.

I landed on the following instructions:
> NOP - Classic empty row instruction just to increment PC
> BRnzp - Branching instruction using an NZP register to enable conditional statements and loops
> CMP - Comparison instruction to set the NZP register for later use by the BRnzp instruction
> ADD, SUB, DIV, MUL - Basic arithmetic instructions to enable simple tensor computations.
> STR/LDR - Store/load data in global data memory to access initial data and stores results.
> CONST - Load constant values into registers for convenience
> RET - Signal that a thread has completed execution.

Below is the complete table of the ISA I came up with including the exact structure of each instruction.

Step 4 ✅: Writing matrix math kernels using my ISA

Now that I had my own ISA, I created wrote 2 matrix math kernels to run on my GPU.

Each kernel specifies the matrices to manipulate, the number of threads to launch, and code to execute in each thread.

My matrix addition kernel adds two 1x8 matrices using 8 threads and demonstrates the use of the SIMD pattern, some basic arithmetic instructions, and the load/store functionality.

My matrix multiplication kernel multiplies two 2x2 matrices using 4 threads and additionally demonstrates branching and loops.

Demonstrating matrix math functionality was critical since the basis of modern GPU use-cases in both graphics and machine-learning revolves heavily around matrix computations (granted with far more complex kernels).

Below are the kernels I wrote for matrix addition and multiplication.

Step 5 ✅: Building my GPU in Verilog & running my kernels

After designing everything I needed to, I finally started building my GPU design in Verilog.

This was by far the hardest part. I ran into so many issues, and learned the hard way. I rewrote my code several times.

Rewrite 1:
I initially implemented global memory as SRAM (synchronous).

I ran into @realGeorgeHotz who gave me feedback that this defeats the entire purpose of building a GPU - the biggest design challenge of GPUs is managing the latencies of accessing async memory (DRAM) w/ limited bandwidth.

So, I ended up rebuilding my design using external async memory instead & eventually realized I also needed to add memory controllers.

Rewrite 2:
I implemented my GPU with a warp-scheduler (big mistake, far too complex and unnecessary for the goals of my project) initially.

Again feedback from @realGeorgeHotz helped me realize that this was an unnecessary complexity.

The irony is that when I first got the feedback, I didn't have enough context to fully understand it. So I spent time trying to build out a warp scheduler, and only then realized why it was a bad idea lmao.

Rewrite 3:
Didn't implement scheduling within each compute core correctly the first time around, had to go back and design my compute core execution in stages to get the control flow right.

But, despite the difficulty, this is the step where so much of my learning really sunk into deep intuitions.

By running into issues head on, I got a much more visceral feeling for the challenges that GPUs are built to work around.

> When I ran into memory issues, I really felt why managing access from bottlenecked memory is one of the biggest constraints of GPUs.

> I discovered the need for memory controllers from first principles when I when my design wouldn't work because multiple LSUs tried to access memory at once and realized I would need a request queue system.

> As I implemented simple approaches in my dispatcher/scheduler, I saw how more advanced scheduling & resource management strategies like pipelining could optimize performance.

Below is the execution flow of a single thread I built into my GPU in Verilog - it closely resembles a CPU in it's execution.

After tons of redesigns, finally running my matrix addition & multiplication kernels and seeing things work properly and my GPU output the correct results was an incredible feeling.

Here's a video of me running the matrix addition kernel on my GPU, going through the execution trace of the GPU running, and then checking out the end state in data memory where the GPU has stored the final results.

You can see individual instructions, PCs, ALU processing, register values, etc. for each thread/core on each cycle within the execution trace.

Most importantly, you can see at the start the empty addresses of the resultant matrix, and then at the end the correct values being loaded into the result matrix in data memory!

Step 6 ✅: Converting my design into a full chip layout

With my complete Verilog design, the last step was to pass my design through the EDA flow to create a finalized chip layout.

I targeted the Skywater 130nm process node for my design (which I made a few designs for 2 weeks ago and also submitted one of my designs to be fabricated via @matthewvenn's Tiny Tapeout 6)

This step is the real reality check of designing any chip.

You may have a design that works in theory and in simulation, but converting that design into a finalized chip layout with GDS files is the real barrier to shipping your design.

I ran into several issues in this process as well where my chip didn't pass some of the Design Rule Checks (DRCs) specified by the OpenLane EDA flow I was using, and had to rework parts of my GPU to fix these issues.

After some work, I finally got a hardened version of GPU layout with the necessary GDS files for submission (displayed below)

I may also build an adapter for this design and submit it for tapeout via Tiny Tapeout 7!

Here's me playing with a cool 3D visualization of my chip design - I can zoom into different parts of the chip, isolate different metal layers, and look at individual gates & structures in the design.

Finally - checkout my tiny-gpu project for all the project details!

I built tiny-gpu to create a single resource for people to learn about how GPUs work from the ground up.

Built it with <15 files of fully documented Verilog, complete documentation on architecture & ISA, working matrix addition/multiplication kernels, and full support for kernel simulation & execution traces for anyone who wants to play with it and learn.

I'll considering writing a blog post on understanding GPU functionality & architectures from the ground up if people are curious to learn about this topic.

github.com/adam-maj/tiny-…

Here's another link to the GitHub since the above link get's cut off for some reason:

github.com/adam-maj/tiny-…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling