I've spent the past ~2 weeks trying to make a chip from scratch with no prior experience. It's been an incredible source of learning so far.
Progress tracker in thread (coolest stuff at the end)π
Step 1 β : Learning the fundamentals of chip architecture
I started by learning how a chip works all the way from binary to C.
This part was critical.
In order to design a chip, you need a strong understanding of all the architecture fundamentals as you constantly work with logic, gates, memory, etc.
I reviewed the entire stack:
> Binary - using voltage to encode data
> Transistors - using semi-conductors to create a digital switch
> CMOS - using transistors to build the first energy efficient inverter
> Gates - using transistors to compute higher level boolean logic
> Combinatorial Logic - using gates to build boolean logic circuits
> Sequential Logic - using combinatorial logic to persist data
> Memory - using sequential logic to create storage systems for data
> CPU - combining memory & combinatorial logic to create the Von Neumann architecture, the first Turing Complete step (ignoring memory constraints)
> Machine Language - how instructions in program memory map to control signals across the CPU
> Assembly - how assembly maps directly to the CPU
> C - how C compiles down into assembly and then machine language
Coming from a software background, fully understanding the connection between each of these layers in depth unlocked so many intuitions for me.
Planning to write a post about the full stack of compute soon (including all these layers + the relevant layers of chip fabrication & design)
Step 2 β : Learning the fundamentals of chip fabrication
Next, I learned about how transistors are actually fabricated.
Chip design tools are all built around specific fabrication processes (called process nodes), so I needed to understand this to fully grasp the chip design flow.
I focused on learning about:
> Materials - There are a huge number of materials required for semiconductor fabrication, including semi-conductors, etchants, solvents, etc. each with specific qualities that merit their use.
> Wafer Preparation - Creating silicon wafers with poly-silicon crystals and "growing" Silicon Dioxide layers on them
> Patterning - The 10 step process using layering (oxidation/layer deposition/metallization), photo-lithography, and etching to create the actual transistor patterns on the chip
> Packaging - Packaging the chips in protective covers to prevent corruption, create I/O interfaces, help w/ heat dissipation, etc.
> Contamination - Was interesting learning about how big of a focus contamination control is/how critical it becomes as transistor sizes decrease
Within each of these topics, there's so much depth to go into - each part of the process has many different approaches, each using different materials & machines.
I focused more on getting a broad understanding of the important parts of the process.
The most important intuition here is that chips are produced by defining the layout of different layers.
The design of these layers is what you produce as the output of the chip design (EDA) process.
Step 3 β : Starting electronic design automation by making a CMOS transistor, layer-by-layer
The CMOS transistor is the fundamental structure that enabled digital computation to take off because of it's unique energy efficiency
Drawing each layer of the CMOS manually made the design of a transistor much more clear.
In computer architecture, the common explanation of transistors is actually heavily over-simplified - whereas designing the actual transistor layer-by-layer forced me to go more into the actual implementation details.
Looking at the voltage and current diagram (right) vs. the equivalent for an individual nMOS transistor also made the power efficiency gains of the CMOS much more obvious.
In the picture below, each different color specifies a different layer, each made with different materials/ions/etc. and created in a different step of the fabrication process - for example, the red poly-silicon layer is the actual GATE for the top nMOS and bottom pMOS transistor, and the light blue metal 1 layer are the actual connections to input & output.
Step 4 β : Creating my first full circuit in Verilog
This part was a cool unlock for me - my first experience with programming hardware using software.
I made my first circuit with the hardware description language (HDL) Verilog.
I made an RGB mixer circuit that converts signal from 3 rotating dials to pulses for 3 LEDs.
You can use HDLs to specify individual gates, but thank fully most foundries ship libraries of standard cells for customers to use.
A standard cell is just an arrangement of transistors for a common use (like an AND gate) that's heavily optimized for efficiency and designed for a specific foundries fabrication process.
These standard cells libraries have almost all of the logic units you would actually need for most designs, so you don't need to dip too much into the gate level.
I created this circuit with the standard cell library for the Skywater 130nm process node (a specific fabrication process from a foundry called Skywater).
I know code is mostly useless to look at, but wanted to include it for anyone curious. The timing diagram shows 3 different knobs being turned and the corresponding LEDs being turned on.
Step 5 β : Implementing simulation & formal verification for my circuit (disclaimer, this part might be boring, skip to the end if you want)
Since the cost of bugs in hardware is far higher than in software (since you can't change stuff once you're design is fabricated), using extensive testing and formal verification is a critical part of the design process.
Throughout the EDA flow, you use:
> Static-timing analysis - make sure there are no timing errors because of how signals propagate through your circuit
> Bounded model checking & k-induction - make sure that it's impossible for your design to get into certain invalid states
> You also want to make sure that it's possible for your circuit to get into specific valid states
I implemented all of these steps to formally verify that my RGB mixer circuit & other designs were valid (proper expected state transitions occur)
Step 6 β : Designing my first full chip layout
This was the coolest part of the process so far. I used OpenLane (an open-source EDA tool) to perform the entire synthesis, optimization, and layout process on my design and come up with a complete chip design.
Just seeing my Verilog code get turned into an actual chip layout and being able to go in and play with all of the layers and click into each gate was a sick unlock.
The OpenLane flow deals with all of the following:
> Simulation - running simulation to verify that your design passes all test cases
> Synthesis - convert HDL to a net-list that shows the connections between all the gates in your design
> Optimization - optimize area, performance, and power consumption of your design
> Layout - lays out all the standard cells on the physical chip
> Wiring - connects all the components together with proper wiring
> Verification - runs formal verification on your end design
> GDS - creates the final output files, called GDS2 files, which specify the exact layers to be sent to a foundry for tape-out
Here's me playing with my design in the EDA tool
> I can zoom in and look at the individual cells and transistors
> I can selectively hide different metal layers to get a sense of how everything is connected
> I can view energy density, component density, etc. on my design
Step 7 β : Reverse-engineering and designing a GPU from scratch
My initial goal for my project was to build a minimal GPU. I didn't realize how hard that was going to be.
My expectation was that building a GPU was going to be similar to building a CPU, where there are a ton of learning resources online to figure out how to do it.
I was wrong.
Because GPU companies are all trying to keep their secrets from each other, most of the GPU architecture data is all proprietary and closed source.
NVIDIA and AMD release high-level architecture overviews, but leave all the details of how their GPUs work at a low-level completely undocumented.
This makes things way more fun for me - I basically have a few high-level architecture docs + some attempts at making an open-source GPU design, and zero public learning resources about GPU architectures.
From this, I've been trying to reverse engineer the details of how a GPU architecture works (of course at a much simpler level) based on what I know they do + what has to be true.
Claude Opus has been a huge help here - I've been proposing my ideas for how each unit must work to Claude, and then somehow (through inference from what it knows, or training on proprietary data) it will guide me toward the right implementation approaches which I can then go and confirm with open-source repos - but if I search some of the things publicly, nothing shows up which is a testament to how well hidden the implementation details are.
So TL;DR - I'm still building a minimal GPU design. I'm also going to document how everything works & make a post about it so it's more clear for anyone else who gets curious.
Will be shipping this in the next few days, and will probably send a cut down version to be taped out on the Skywater 130nm process node.
Very excited for this project!
I posted my full learning plan earlier for anyone curious
I've spent the past ~6 weeks going through the entire history of robotics and understanding all the core research breakthroughs.
It has completely changed my beliefs about the future of humanoid robotics.
My process + biggest takeaways in thread (full resource at the end) π
Step #1 β : Learning the fundamentals of robotics from papers
Over the past 2 years, $100Ms has been deployed into the robotics industry to fund the humanoid arms race.
From twitter hype, public sentiment, and recent demos, it seemed to me that fully autonomous general-purpose robotics was right around the corner (~2-3 years away).
With this in mind, I decided to learn the fundamentals of robotics directly from the primary source: the series of research papers that have gotten us to the current wave of humanoid robotics.
As I got farther into the research, I realized that it pointed to a very different future, and potentially much longer timelines, than what current narratives may suggest (discussed in detail in the full resource at the end).
I initially focused on learning about the following topics/trail of breakthroughs that led us to where we are today:
(the repository at the end of the thread covers all of these + more in greater detail)
Classical Robotics
> SLAM - Simultaneous localization and mapping systems allow robots to convert various sensor data into a 3D map of the environment, and allow the robot to understand it's place within it.
> Hierarchical Task Planning - Early robotic planning systems used hierarchical/symbolic models to organize the series of tasks to execute.
> Path Planning - Sampling based path-finding algorithms used to find best-effort paths that avoid collisions in high-dimensional environments.
> Forward Kinematics/Dynamics - Using physics models to predict where the robot will move given specific actuator inputs.
> Inverse Kinematics/Dynamics - Using physics models to predict how to control actuators given a target position.
> Contact Modeling - Modeling the precise friction and torque forces at contact points to understand when objects are fully controlled (form closure) and how robotic movement will manipulate objects.
> Zero-Moment Point (ZMP) - The point where all the forces around a robots foot cancel each other, used in calculations to enable robot locomotion and balance.
Deep Reinforcement Learning
> Deep Q-Networks - The earliest successful deep RL algorithms that first rose to popularity with the success of DQN on Atari.
> A3C/GAE - Allow RL systems to learn from long-horizon rewards, critical for most robotic tasks with delayed feedback.
> TRPO/PPO - Introduced an effective method for step-sizing parameter updates, which is especially critical for training RL policies.
> DDPG/SAC - Introduced more-sample efficient RL algorithms that could reuse data multiple times, valuable for training real-world robotic control policies with expensive data collection
> Curiosity - Using curiosity as a reward signal to train RL systems in complex environments, which has proven valuable for recent robotic foundational models.
Simulation & Imitation Learning
> MuJoCo - Simulation software built specifically for the needs of robotics that introduced more accurate joint and contact modeling. Enabled a series of research breakthroughs in training robots in simulation.
> Domain Randomization - Randomizing objects, textures, lighting, and other environmental conditions in simulation to get robots to generalize to the complexity of real world environments.
> Dynamics Randomization - Randomizing the laws of physics in simulation to teach robots to treat the real world as just another random physics engine to generalize to, bridging the simulation-to-reality gap.
> Simulation Optimization - Optimizing the specific domain/dynamics randomization levels to enable robot policies to train quickly in simulation.
> Behavior Cloning - Learning control policies by trying to clone the behavior of a human demonstrator. This approach has now given rise to tele-operated robotics.
> Dataset Aggregation (DAgger) - Using a data-collection loop between RL algorithms and experts to create more robust datasets for imitation learning.
> Inverse Reinforcement Learning (IRL) - Learning RL policies from demonstrators by trying to guess their reward function (trying to understand what the demonstrators goals are).
Generalization & Robotic Transformers
> End-to-end Learning - Researchers started to train robotic control policies with a single integrated visual and motor system, rather than with separate components. This started the end-to-end integration trend that has continued today.
> Tele-operation (BC-Z, ALoHa) - Cheaper tele-operation hardware and better training methods enabled the first capable robots trained with large volumes of data from human demonstrators operating robots. This is now the most widely used method for training frontier robotic control systems.
> Robotic Transformer (RT1) - The first successful transformer based robotics model that used a series of images and a text prompt passed to a transformer to output action tokens that could operate actuators.
> Grounding Language Models (SayCan) - Trained an LLM to understand a robot's capabilities, allowing it to carry out robotic task planning based on images and text that was grounded in reality.
> Action Chunking Transformer (ACT) - A transformer based robotic architecture that introduces "action chunking" allowing the model to predict the next series of actions instead of just a single time step. This enabled far smoother motor control.
> Vision-Language-Action Models (RT2) - Used modern vision-language models (VLMs) pre-trained on internet-scale data and fine-tuned them with robotic action data to achieve state of the art results. Arguably the most impactful milestone in recent robotics research.
> Cross-embodiment (PI0) - Training robotics models that work on a variety of different hardware systems, allowing the model to generalize beyond an individual robot to understand broad manipulation. Combined ACT + VLA + diffusion into a single SOTA model.
Note that though I didn't focus on robotic hardware in this tweet because most recent progress in robotics has been on the software side, the hardware aspects are covered in the full repository at the end of the thread.
Step #2 β : Building core intuitions from each paper
I initially focused on building the core intuitions for the innovations introduced by each paper.
First, I tried to build an intuition for the fundamental goals and challenges of the robotics problem.
I saw that at the simplest level, robots convert ideas into actions.
In order to accomplish this, robotic systems need to: 1. Observe and understand the state of their environment 2. Plan what actions they need to take to accomplish their goals 3. Know how to physically execute these actions with their hardware
These requirements cover the 3 essential functions of all robotic systems: 1. Perception 2. Planning 3. Control
The entire history of research breakthroughs in robotics fall into innovations in these key areas.
Though we might expect planning to be the most difficult of these problems since it requires complex high-level reasoning, it turns out that this is actually the easiest of the problems and is largely solved.
Meanwhile, robotic control is by far the hardest of these problems, due to the complexity of the real-world and the difficulty of predicting physical interactions.
In fact, developing generally-capable robotic control systems is currently the largest barrier to robotics progress.
With this in mind, I focused on understanding the phases of progress in the robotic control problem.
1. Classical Control
Initial classical approaches to control used manually programmed physics models.
These models constantly fell short of general capabilities due to the inability to factor in the massive number of un-modeled effects (like variable object textures & friction, external forces, and other variance).
This failure of classical control systems resembled the failure of early manual feature engineering approaches in machine learning, which were later replaced by deep learning.
In general, most classical approaches to robotics have long become obsolete, with the exception of algorithmic approaches to SLAM and locomotion (like ZMP), which are still heavily used in modern state-of-the-art robots.
2. Deep Reinforcement Learning
With the deep reinforcement learning models surpassng human capabilities in games like Atari, Go, and Dota 2 in the 2010s, robotics researchers hoped to apply these advancements to robotics.
This progress was particularly promising for robotics because robotic control is essentially a reinforcement learning problem: the robot (agent) needs to learn to take actions in an environment (to control its actuators) to maximize reward (effectively executing planned actions).
The development of modern reinforcement learning algorithms like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) provided optimization methods with sufficient speed and sample efficiency to effectively train robots (often in simulation).
This brought a new wave of deep learning based robotic control that demonstrated impressive generalization capabilities that far surpassed classical control methods, especially in problems like locomotion and dexterous manipulation.
3. Simulation & Imitation Learning
At the same time as these innovations in deep RL algorithms, progress in simulation & imitation learning also enabled better robotic control policies.
Specifically, the development of the MuJoCo simulator built for robotics (due to higher accuracy physics computations) and methods like domain/dynamics randomization helped to overcome the simulation-to-reality transfer problem where control policies would learn to do well in simulation by exploiting inaccuracies in the simulation, and then would fail when used in the real world.
Additionally, imitation learning methods like behavior cloning and inverse reinforcement learning enabled robotics control based on expert demonstrations.
4. Generalization & Robotic Transformers
Most recently, we have applied learnings from the success of training LLMs to robotics to yield a new wave of frontier robotics models.
Specifically, we have started to train large transformers on internet-scale data and tune them specifically to the robotics problem. This has led to state-of-the-art results.
For example, the vision-language-action (VLA) model proposed by RT2 used an open source vision-language model (multi-modal LLM) pre-trained on internet data and fine-tuned it to understand robotic control.
This system was able to use the high-level reasoning and visual understanding capabilities from the VLMs and apply it to robotics, effectively solving the robotic planning problem and adding impressive generalization capabilities to robotic control.
It's hard to overestimate how much value VLMs have brought to robotic planning and reasoning capabilities; this has been a major unlock on the path toward general-purpose robotics.
At this point, all frontier robotics models are using a combination of the VLA + ACT architecture with their own combination of internet and manually collected data, with pi0 representing the most impressive publicly released model today with frontier generalization capabilities.
Iβve spent the past ~4 weeks understanding how the energy industry will define our future.
I synthesized everything I learned into a single resource.
My biggest takeaways in thread + complete resource at the endπ
Part 1 β: Understanding the fundamentals of energy
My goal was to understand the full picture of energy, which is rarely covered in one place because of how big the industry is.
I thought this would be particularly interesting given the recent focus on energy as a primary constraint limiting progress in deep learning.
I wanted to go deep on: 1. the fundamentals of energy physics 2. how energy usage has shaped human civilization 3. modern energy usage 4. energy production 5. energy distribution 6. energy storage 7. how energy relates to geopolitics 8. current trends in the energy industry 9. how energy relates to the future of technology
I started by focusing on energy physics to understand:
> What actually is energy?
> How do energy exchanges define what we can do?
> How do the laws of thermodynamics limit our energy consumption?
I learned that energy is much more fundamental and intuitive than I expected:
> Energy is simply the ability to cause change in the universe
> All energy is a result of particles interacting with the 4 fundamental forces
> Energy exchange is really a change in the types of things carrying energy
> The amount of useful energy in the universe is always decreasing
> Energy is the fundamental constraint on economic growth, to a degree I didnβt expect
This gave me a much clearer framework to contextualize the rest of my deep dive.
Part 2 π₯: Understanding humanities energy needs
Next, I focused on understanding how our energy usage has changed over the course of human history.
This made it clear how critical energy has been to the development of modern civilization.
It became clear that all of human history can be framed as the result of a series of energy transitions:
> Fire - burning biomass allowed us to cook foods, expanding the energy sources that we could consume
> Agriculture - farming, irrigation, and animal labor allowed us to develop a consistent energy source for the first time. this let us settle down from being hunter gatherers and created an energy surplus for the first time in history. as a result, agrarian society emerged.
> Specialization - the energy surplus created by agriculture is what enabled people to spend their time on things other then looking for energy. people started to specialize, and trade emerged.
> Transportation - we harnessed energy from the wind (sailing) and animals (horses) to move around the world. this enabled broader trade that allowed market societies to emerge.
> Coal & the steam engine - coal provided the first abundant & dense energy source good enough to support the function of machines. this is what triggered the industrial revolution.
> Electricity - the creation of power plants and grids allowed the instant transport of energy, electrifying society. this enabled the light bulb, refrigerator, and eventually the semiconductor that led to the information age and modern computing
Each of these transitions can be viewed as the result of our increasing ability to: 1. capture more energy 2. do more with less energy
Next, I looked at the technologies that enable us to use energy to do everything we need in modern society.
It became clear that all energy usage can actually be reduced to just 4 fundamental energy needs: 1. Heating 2. Lighting 3. Movement 4. Computation
All other energy usage is actually made up of a combination of these needs.
Finally, I looked into global energy consumption to understand the needs that our modern energy systems have to serve.
The most useful stat on modern energy consumption is that humanity requires ~18 terrawatts (TW) of power (energy in joules required per second) to function.
This provides valuable context to evaluate modern energy production methods based on their ability to fulfill this need.
I've spent the past ~3 weeks going through the entire history of deep learning and reimplementing all the core breakthroughs.
It has completely changed my beliefs about deep learning progress and where we're headed.
Progress tracker in thread (all resources at the end) π
Step #1 β : Learning the fundamentals of deep learning from papers
I wanted to learn about the fundamentals of deep learning directly from the source of progress: the critical papers that have gotten us from simple feed-forward networks to models like GPT-4o.
I suspected that this would show me broader trends and intuitions that aren't obvious when learning about AI through popular courses, textbooks, or public narratives.
This approach turned out to be critical.
I focused on learning about the following trail of breakthroughs that led us to where we are today:
(the repo later in this thread includes my in-depth explanations of core intuitions, math, and implementations (when relevant) for each of these, for anyone curious)
Early Neural Networks & CNNs
> Backpropagation - The foundational algorithm that enabled deep-learning and gradient descent
> LeNet - An early convolutional neural net that showed signs of beating traditional ML models at digit recognition
> AlexNet - Completely changed the history of deep learning and brought new focus onto the field by beating the state-of-the-art for image classification. This is where the broader community started taking deep learning seriously.
> U-Net - An effective image-to-image architecture based on the CNN that's now used in all diffusion models
Optimization & Regularization
> Weight Decay - The earliest improvement to make models generalize by penalizing them for large weights
> ReLU - Game-changing activation function that enabled sparse representations in neural networks for the first time
> Residuals - Solved the vanishing and exploding gradient problems, enabling deeper networks
> Dropout - Solved regularization by forcing neurons to learn robust representations (via blocking the effects of random neurons during training)
> BatchNorm - Solved the "internal covariate shift" problem which also enabled deeper networks
> LayerNorm - Made BatchNorm usable for sequential models
> GELU - A modern activation function merging the value of ReLU & Dropout and used in most models today
> Adam - Adding momentum to stochastic gradient-descent to make models converge faster
Sequence Modeling
> RNN - Introduced the idea of sequence-modeling, which started the path that led us to the transformer
> LSTM - Made RNNs actually useful by introduced "gated" memory to learn long-term relationships between inputs
> The Forget Gate - Added the ability for LSTMs to "learn to forget" which made them capable of processing long sequences of text
> Word2Vec (& Phrase2Vec) - Introduced the first popular text embedding models, starting the trend that led us to the creation of CLIP
> Encoder-Decoder & Seq2Seq - Powerful text models built on RNNs and LSTMs (for machine translation) that directly set the stage for the transformer
> Attention - The core inductive bias behind transformers. It was initially built on-top of RNN/LSTM based models. Hence, "attention is all you need" showed that you could remove everything else
> Mixture of Experts - The first effective implementation of "conditional computation" for neural networks that led to one of the advancements behind GPT-4
Transformers
> Transformer - The critical paper that completely changed the history of deep learning again, introducing an architecture capable of learning complex relationships & (importantly) highly parallelizable in training.
> BERT (& RoBERTa) - The first model to successfully execute the pre-training & fine-tuning paradigm, showing us what transformers were capable of
> T5 - Introduced the idea of the general "text-to-text" learning task that now underlies all LLMs
> GPT-2 & GPT-3 - No explanation needed. Most interesting here was their hard bet on the scaling laws (before they were consensus) and being right.
> LoRA - An efficient method for fine-tuning models (which also showed us something interesting about
> RLHF & InstructGPT - GPT-3 didn't really reach the mainstream until the creation of ChatGPT, enabled by the successful fine-tuning of an "assistant mode" introduced by these papers
> Vision Transformer - Introduced the ability for transformers to process images in "patches" which became critical for multi-modality
Image Generation
> GAN - The first effective approach to image synthesis, using the game-theoretic "adversarial" optimization of a generator and discriminator network
> VAE (& VQ-VAE, VQ-VAE-2) - Probabilistic approach to image synthesis that constrains the model to form low-dimensional representations of images, forcing the separation of high-level features and details
> Diffusion (& Denoising Diffusion, etc.) - Enabled the best current state-of-the-art image synthesis
> CLIP - The embedding model that first introduced the possibility for multi-modality, by compressing understanding of images and captions into a single representation space
> DALL E (& DALL E 2) - Building on VAEs, CLIP, and diffusion models to create state-of-the-art controlled image synthesis models
Step #2 β : Building core intuitions from each paper
I started by trying to understand the core intuitions and math for each paper.
Going through the early CNN, optimization, and regularization papers, this process was straightforward.
Each of these papers builds directly on top of the core of DNNs, and shows empirically the approaches that solved specific problems with scaling neural networks. Assuming a strong fundamental understanding of backpropagation, they were mostly intuitive.
Specifically, the framework of thinking about each advancement in-terms of how it affects gradient flow in a neural network was particularly effective.
The specific math behind RNNs & LSTMs was a bit more challenging (took some time to fully understand how gradient-flow are manipulated by LSTM gates), but aside from that, the sequence modeling and transformer sections were also intuitive.
Many of the advancements in transformers after the original Attention Is All You Need paper are about modifying training objectives, small implementation details, and just scaling up the models.
However, when I got to the generative models section, I got hit with a completely new level of difficulty.
Getting through the papers for Variational Auto Encoders and Diffusion models was brutal. Diffusion alone took me a few days to fully wrap my head around all the math (especially the equations in the original thermo diffusion & denoising diffusion papers).
Because these models draw their inspiration from thermodynamics (Langevin dynamics), they deal with concepts far more complex than the remaining history of deep learning.
It was painful getting through this part, but felt great at the end when I was finally able to grasp the math.
I've spent the past ~2 weeks building a GPU from scratch with no prior experience. It was way harder than I expected.
Progress tracker in thread (coolest stuff at the end)π
Step 1 β : Learning the fundamentals of GPU architectures
I started by trying understand how modern GPUs function down to the architecture level.
This was already harder than I anticipated - GPUs are proprietary tech, so there are few detailed learning resources online.
I started out trying to understand the GPU software pattern by learning about NVIDIA's CUDA framework.
This helped me understand the Same Instruction Multiple Data (SIMD) programming pattern used to write GPU programs called kernels.
With this context, I dove into learning about the core elements of GPUs:
> Global Memory - external memory that stores data & programs accessing it is a huge bottleneck & constraint on GPU programming
> Compute Cores - the main compute units that execute kernel code in different threads in parallel
> Layered Caches - caches to minimize global memory access
> Memory Controllers - handles throttling requests to global memory
> Dispatcher - the main control unit of the GPU that distributes threads to available resources for execution
And then within each compute core, I learned about the main units:
> Registers - dedicated space to store data for each thread.
> Local/Shared Memory - memory shared between threads to pass data around to each other
> Load-Store Unit (LSU) - used to store/load data from global memory
> Compute Units - ALUs, SFUs, specialized graphics hardware, etc. to perform computations on register values
> Scheduler - manage resources in each core and plans when instructions from different threads get executed - much of GPU complexity lies here.
> Fetcher - retrieves instructions from program memory
> Decoder - decode instructions into control signals
This process gave me a good high-level understanding of the different units in modern GPUs.
But with so much complexity, I knew I had to cut down the GPU to the essentials for my own design or else my project would be extremely bloated.
Step 2 β : Creating my own GPU architecture
Next, I started to create my own GPU architecture based on what I learned.
My goal was to create a minimal GPU to highlight the core concepts of GPUs and remove unnecessary complexities so other's could learn about GPUs more easily.
Designing my own architecture was an incredible exercise in deciding what really matters.
I went through several iterations of my architecture throughout this process as I learned more by building.
I decided to highlight the following in my design:
> Parallelization - How is the SIMD pattern implemented in hardware?
> Memory Access - How do GPUs handle the challenges of accessing lots of data from slow & limited bandwidth memory?
> Resource Management - How do GPUs maximize resource utilization & efficiency?
I wanted to highlight the broader use-cases of GPUs for general purpose parallel computing (GPGPUs) & ML so I decided to focus on the core functionality rather than graphic specific hardware.
After many iterations, I finally landed on the following architecture that I implemented in my actual GPU (everything is in it's simplest form here)