Pavan Jayasinha Profile picture
May 4 10 tweets 10 min read Read on X
I implemented an LLM end-to-end in hardware, and ran it on an FPGA.

Zero Python. Zero CUDA. Just pure SysVerilog.

All my progress + everything I learned from 200h of LLM chip design (demo at the end)👇 Image
Before we dive into the project log, some context:

I made this as part of a lab project for the most cracked course at Waterloo: ECE 327. Nachiket Kapre (the Prof) designed this lab.

Because of it, I went from knowing ZERO Verilog to now squashing delta‑cycle races caused by #0 delays in fork/join_any constructs.

I thought this lab is so goated that I'd make a detailed thread describing how anyone with some RTL skills could build this from absolute scratch (including deciding on the architecture / model support itself).

DISCLAIMER: this thread will be detailed and likely not make sense for those with zero RTL experience.
Phase 0: Learning system verilog

To implement anything in hw, you need to know a hardware description language (HDL). Since the course used system verilog, I got cracking on hdlbits.01xz.net

Hdlbits taught me the core syntax & structure of verilog:
• What is a module and how to define it (module ... endmodule).
• Input/output/inout port declaration (input, output, inout).
• Data types (logic, parameter, integer).
• Basic assignment (assign for combinational, = blocking, <= non-blocking for sequential).

And then to understand the general software pattern for digital design, I had to understand:
• Behavioral & Structural Modeling: always blocks (especially @(posedge clk) for synchronous logic). Understanding blocking vs. non-blocking assignments in sequential circuits. Module instantiation (connecting ports by name or order). Hierarchical design (modules within modules).
• Synchronous Design: Designing logic triggered by a clock edge. Understanding registers and flip-flops (how they are inferred).
• Combinational Logic: Designing logic without clock dependencies.

This process gave me a good high level understanding of verilog, but to complete this lab I needed to know these crucial techniques:
• Finite State Machine (FSM) design (state registers, next-state logic, output logic).
• Pipelining concepts (inserting registers to break critical paths).
• Working with vectors/arrays (e.g., logic [N-1:0] data;, connecting to array ports).
• Basic simulation constructs ($display, understanding testbench structure).
• Grasping how signal bundles (like AXI streams with tvalid/tready/tdata/tlast) work for handshaking, even if only connecting pre-made interfaces.

Essentially, a solid foundation in digital logic design principles implemented using Verilog syntax and common structural/behavioral modeling techniques proved essential.Image
Phase 1: Making high-level model arch decisions - BERT transformer + int quantization

To ensure the transformer weights fit on the FPGA and are edge-friendly, the Prof chose to focus on building the chip for the I-BERT architecture. BERT is one of the first large transformers and the 'I' means it's end-to-end quantized on INT8 (reducing memory footprint).

Making this decision required:
• Studying the standard floating-point BERT/LLM attention mechanism.
• Researching and selecting integer quantization techniques (like those in the I-BERT paper). This involves choosing fixed-point representations or integer approximations for all operations.
• Adapting algorithms like Softmax, GELU, and Layer Normalization to use integer-only math (this was done in
arxiv.org/pdf/2101.01321).
• Defining the exact mathematical operations, intermediate data types (bit widths like 8b, 16b, 32b, 64b), and precision requirements for each step shown in the diagram (e.g., the requant steps). This means implementing python functions that produce the bit-matching expected outputs for each set of expected inputs.[IBERT](https://arxiv.org/pdf/2101.01321), Prof Nachiket Kapre IBERT py system diagram
Phase 1.5: Choosing the systolic array microarchitecture (Google's TPU architecture) + AXI communication interface

With the model architecture and algorithms ironed out, we can decide on a compatible hardware architecture:
• Core acceleration strategy: we chose a systolic array microarchitecture for matrix multiplication due to its efficiency on FPGAs.
• Memory Strategy: we need to determine how large matrices (weights W, intermediate results Q, K, V, S, P, C) will be stored (likely external RAM banks on the FPGA) and how they will be accessed. Also needing to plan for data partitioning across memory banks.
• Interface Definition: we must choose standard interfaces for communication between modules (AXI Stream for high-throughput data, AXI Lite for control/configuration).

This phase was already completed for me since the prof had made these high level architectural decisions to ensure the design was tractable within a semester.[Systolic Array](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.researchgate.net%2Ffigure%2FArchitecture-of-a-systolic-array-based-DNN-accelerator-that-serves-as-a-baseline-for_fig1_325861610&psig=AOvVaw17nf5ySwuOaUUrGT-KJTUk&ust=1746105829809000&source=images&cd=vfe&opi=89978449&ved=0CBcQjhxqFwoTCKDc0qjt_4wDFQAAAAAdAAAAABAE)
Phase 2 (70% of the work): Implementing core computations in verilog

Now that I know the exact algorithms the hardware is supposed to implement, I need to translate this functionality into functionally equivalent verilog modules:

1. Basic Arithmetic Blocks:
-> Designed and implemented the fundamental Verilog modules: acc, mac, max, div, exp, gelu, sqrt.
-> Developed thorough Verilog testbenches (_tb.sv) for each, comparing outputs against Python reference implementations (.py) from the previous phase. Perform rigorous functional simulation (e.g., using Xilinx Simulator xsim or Verilator).

2. Systolic Array:
-> Designed the Processing Element (pe.sv), focusing on the MAC operation and data forwarding logic. Test it thoroughly.
-> Designed the systolic.sv module, using Verilog generate statements to create the N1xN2 grid and wire the PEs together, handling input injection and result shifting. Created a specific testbench for the systolic array, feeding it matrix data and verifying the output against numpy.matmul. Simulate extensively.

3. Transformer block primitives (softmax, layer_norm):
-> Designed softmax.sv and layer_norm.sv, integrating the primitive modules from step 1. Crucially, design the necessary buffering (e.g., using sreg.sv or fifo.sv) to handle latency differences between pipelined blocks (exp) and multi-cycle/reduction blocks (max, acc, div, sqrt). Developed testbenches and simulate, comparing against Python reference implementations again.

Having a waveform analysis tool and $display monitor was crucial to solving a bunch of nasty bugs during this stage. Once all functional tests pass, I was sure that the verilog modules were equivalent to the python kernels assuming zero wire propagation delays.Adapted from Prof Nachiket Kapre's System Diagrams
Phase 3: Implementing the memory controllers/address generators + AXI interface modules

With the computational verilog kernels complete, the memory modules are left. This is the infrastructure enabling smooth data movement between modules and memory banks.

1. Memory Interface Logic Tasks:
-> Design the memory bank wrappers
-> Design the address generators for the systolic array (mem_write_A, mem_write_B, mem_read_D). This is complex, involving cascaded counters to handle matrix partitioning and data interleaving/permuting. Getting these right requires standalone tests and really understanding the data layouts.

2. AXI Interface Modules Tasks:
-> Define or adopt standard AXI Stream interface definitions (axi_stream_if.sv).
-> Develop modules to bridge between memory banks and AXI streams (s2mm.sv - Stream to Memory Map, mm2s.sv - Memory Map to Stream). These handle the handshaking and data packing/unpacking.

3. Control Logic Tasks:
-> Design top-level control FSMs or logic (control.sv, attn_head_fsm.sv) to manage the overall flow: loading weights/biases, starting computations in sequence, handling valid/ready signals between stages.Address generator diagrams from Prof Nachiket Kapre
Phase 4: Stitching everything together (top-level assembly)

With all of the core verilog modules built, now comes stitching them together into higher level usable modules like mm (matmul), mm_gelu (matmul gelu fusion), attn_head (full self attention computation), and then instantiating these higher level modules to define the chip layout with necessary clock generators.

1. Sub-System Assembly:
-> Create wrapper modules like mm.sv (combining systolic with memory interfaces and address generators), mm_ln.sv (combining mm, requant, mat_add, layer_norm_top), mm_gelu.sv (combining mm, requant, gelu_top), attn_head.sv (integrating multiple mm instances, requant, softmax_top, concat).
-> Connect these sub-systems using the defined AXI interfaces. Pay close attention to matching data widths and AXI signal connections (tdata, tvalid, tready, tlast).

2. Top-Level Verilog Design (ibert_axi.v / _on_axis.v):
-> Instantiate all the major sub-systems (attn_head, mm_ln, mm_gelu, etc., depending on the final target).
-> Instantiate clock generators, reset logic.

3. Constraint Files (clock.xdc): Define clock frequencies and pin assignments for the target PYNQ board.Image
Phase 5: Design verification (Synthesis + Post-Implementation simulations)

Even though this is the last phase, in actual development, testing and design verification is done alongside developing each module. It's the most important stage to verify your HDL performs as expected.

Here's what design verification means:
- After writing the verilog for a module, you (1) synthesize it which means you generate a circuit of gates (from an available set of primitives), and then (2) you implement it which means you place and route these gates on a die or in the case of an FPGA you lay out the LUTs and DSPs physically on the chip.
- After each stage, a new granularity of testing is unlocked: (1) functional verification is the fastest simulation and it checks that you're generating the correct waveforms by running the verilog module assuming zero propagation delay, (2) synthesis verification tests that you're simulation waveforms match expected once you account for the delay models of each gate/component, and (3) post-implementation simulation is the slowest simulation of the final netlist after place-and-route which simulates the entire chip including gate/component + routing delays.

So this is the workflow I had to go through for doing DV in this project:
1. Code Check (Lint): Catching Verilog errors early.
2. Basic Logic Check (Functional Simulation): Does it do the right math?
3. Hardware Translation Check (Synthesis/Implementation): Does it map correctly to the FPGA?
4. Post-Hardware Mapping Check (Post-Synth/Impl Simulation): Did the mapping break anything? Does it do the right math accounting for gate and routing delays?
5. Speed Check (Timing Analysis): Can it run fast enough to overcome critical path?
6. Efficiency Check (Utilization Analysis): Does it use FPGA resources well?
7. Real-World Check (Board Deployment): Does it work on the actual chip?Image
Finally, here's the culmination of this lab project - feeding a prompt into the I-BERT model deployed natively on the FPGA running tokens in, tokens out! Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Pavan Jayasinha

Pavan Jayasinha Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @pavanjayasinha

Sep 5, 2024
Got nerd-sniped into robotics research 2 months ago with no prior experience

Ended up reimplementing computer vision on my Roomba with SLAM and now it's mapping the lab like a mini-explorer

Here are my 3 most surprising takeaways:
#1 - Many tasks in robotics are better done without ML

Coming from a software/ML background, I asked myself why traditional robotics hasn't been replaced with ML.

I learned that ML often comes at the cost of speed & predictability over many traditional statistical approaches. Image
#2 - ROS[2] makes robot SW a lot more fun

To get this to work, I had to convert an existing SLAM algo developed by Kevin Hu (VIP Lab) here at UWaterloo to ROS2.

It made me realize how much of the grunt work ROS2 alleviates from someone just trying to ship a robotics project. Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(