adammaj Profile picture
Dec 18 7 tweets 21 min read Read on X
I've spent the past ~6 weeks going through the entire history of robotics and understanding all the core research breakthroughs.

It has completely changed my beliefs about the future of humanoid robotics.

My process + biggest takeaways in thread (full resource at the end) 👇Image
Step #1 ✅: Learning the fundamentals of robotics from papers

Over the past 2 years, $100Ms has been deployed into the robotics industry to fund the humanoid arms race.

From twitter hype, public sentiment, and recent demos, it seemed to me that fully autonomous general-purpose robotics was right around the corner (~2-3 years away).

With this in mind, I decided to learn the fundamentals of robotics directly from the primary source: the series of research papers that have gotten us to the current wave of humanoid robotics.

As I got farther into the research, I realized that it pointed to a very different future, and potentially much longer timelines, than what current narratives may suggest (discussed in detail in the full resource at the end).

I initially focused on learning about the following topics/trail of breakthroughs that led us to where we are today:

(the repository at the end of the thread covers all of these + more in greater detail)

Classical Robotics
> SLAM - Simultaneous localization and mapping systems allow robots to convert various sensor data into a 3D map of the environment, and allow the robot to understand it's place within it.
> Hierarchical Task Planning - Early robotic planning systems used hierarchical/symbolic models to organize the series of tasks to execute.
> Path Planning - Sampling based path-finding algorithms used to find best-effort paths that avoid collisions in high-dimensional environments.
> Forward Kinematics/Dynamics - Using physics models to predict where the robot will move given specific actuator inputs.
> Inverse Kinematics/Dynamics - Using physics models to predict how to control actuators given a target position.
> Contact Modeling - Modeling the precise friction and torque forces at contact points to understand when objects are fully controlled (form closure) and how robotic movement will manipulate objects.
> Zero-Moment Point (ZMP) - The point where all the forces around a robots foot cancel each other, used in calculations to enable robot locomotion and balance.

Deep Reinforcement Learning
> Deep Q-Networks - The earliest successful deep RL algorithms that first rose to popularity with the success of DQN on Atari.
> A3C/GAE - Allow RL systems to learn from long-horizon rewards, critical for most robotic tasks with delayed feedback.
> TRPO/PPO - Introduced an effective method for step-sizing parameter updates, which is especially critical for training RL policies.
> DDPG/SAC - Introduced more-sample efficient RL algorithms that could reuse data multiple times, valuable for training real-world robotic control policies with expensive data collection
> Curiosity - Using curiosity as a reward signal to train RL systems in complex environments, which has proven valuable for recent robotic foundational models.

Simulation & Imitation Learning
> MuJoCo - Simulation software built specifically for the needs of robotics that introduced more accurate joint and contact modeling. Enabled a series of research breakthroughs in training robots in simulation.
> Domain Randomization - Randomizing objects, textures, lighting, and other environmental conditions in simulation to get robots to generalize to the complexity of real world environments.
> Dynamics Randomization - Randomizing the laws of physics in simulation to teach robots to treat the real world as just another random physics engine to generalize to, bridging the simulation-to-reality gap.
> Simulation Optimization - Optimizing the specific domain/dynamics randomization levels to enable robot policies to train quickly in simulation.
> Behavior Cloning - Learning control policies by trying to clone the behavior of a human demonstrator. This approach has now given rise to tele-operated robotics.
> Dataset Aggregation (DAgger) - Using a data-collection loop between RL algorithms and experts to create more robust datasets for imitation learning.
> Inverse Reinforcement Learning (IRL) - Learning RL policies from demonstrators by trying to guess their reward function (trying to understand what the demonstrators goals are).

Generalization & Robotic Transformers
> End-to-end Learning - Researchers started to train robotic control policies with a single integrated visual and motor system, rather than with separate components. This started the end-to-end integration trend that has continued today.
> Tele-operation (BC-Z, ALoHa) - Cheaper tele-operation hardware and better training methods enabled the first capable robots trained with large volumes of data from human demonstrators operating robots. This is now the most widely used method for training frontier robotic control systems.
> Robotic Transformer (RT1) - The first successful transformer based robotics model that used a series of images and a text prompt passed to a transformer to output action tokens that could operate actuators.
> Grounding Language Models (SayCan) - Trained an LLM to understand a robot's capabilities, allowing it to carry out robotic task planning based on images and text that was grounded in reality.
> Action Chunking Transformer (ACT) - A transformer based robotic architecture that introduces "action chunking" allowing the model to predict the next series of actions instead of just a single time step. This enabled far smoother motor control.
> Vision-Language-Action Models (RT2) - Used modern vision-language models (VLMs) pre-trained on internet-scale data and fine-tuned them with robotic action data to achieve state of the art results. Arguably the most impactful milestone in recent robotics research.
> Cross-embodiment (PI0) - Training robotics models that work on a variety of different hardware systems, allowing the model to generalize beyond an individual robot to understand broad manipulation. Combined ACT + VLA + diffusion into a single SOTA model.

Note that though I didn't focus on robotic hardware in this tweet because most recent progress in robotics has been on the software side, the hardware aspects are covered in the full repository at the end of the thread.Image
Step #2 ✅: Building core intuitions from each paper

I initially focused on building the core intuitions for the innovations introduced by each paper.

First, I tried to build an intuition for the fundamental goals and challenges of the robotics problem.

I saw that at the simplest level, robots convert ideas into actions.

In order to accomplish this, robotic systems need to:
1. Observe and understand the state of their environment
2. Plan what actions they need to take to accomplish their goals
3. Know how to physically execute these actions with their hardware

These requirements cover the 3 essential functions of all robotic systems:
1. Perception
2. Planning
3. Control

The entire history of research breakthroughs in robotics fall into innovations in these key areas.

Though we might expect planning to be the most difficult of these problems since it requires complex high-level reasoning, it turns out that this is actually the easiest of the problems and is largely solved.

Meanwhile, robotic control is by far the hardest of these problems, due to the complexity of the real-world and the difficulty of predicting physical interactions.

In fact, developing generally-capable robotic control systems is currently the largest barrier to robotics progress.

With this in mind, I focused on understanding the phases of progress in the robotic control problem.

1. Classical Control

Initial classical approaches to control used manually programmed physics models.

These models constantly fell short of general capabilities due to the inability to factor in the massive number of un-modeled effects (like variable object textures & friction, external forces, and other variance).

This failure of classical control systems resembled the failure of early manual feature engineering approaches in machine learning, which were later replaced by deep learning.

In general, most classical approaches to robotics have long become obsolete, with the exception of algorithmic approaches to SLAM and locomotion (like ZMP), which are still heavily used in modern state-of-the-art robots.

2. Deep Reinforcement Learning

With the deep reinforcement learning models surpassng human capabilities in games like Atari, Go, and Dota 2 in the 2010s, robotics researchers hoped to apply these advancements to robotics.

This progress was particularly promising for robotics because robotic control is essentially a reinforcement learning problem: the robot (agent) needs to learn to take actions in an environment (to control its actuators) to maximize reward (effectively executing planned actions).

The development of modern reinforcement learning algorithms like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) provided optimization methods with sufficient speed and sample efficiency to effectively train robots (often in simulation).

This brought a new wave of deep learning based robotic control that demonstrated impressive generalization capabilities that far surpassed classical control methods, especially in problems like locomotion and dexterous manipulation.

3. Simulation & Imitation Learning

At the same time as these innovations in deep RL algorithms, progress in simulation & imitation learning also enabled better robotic control policies.

Specifically, the development of the MuJoCo simulator built for robotics (due to higher accuracy physics computations) and methods like domain/dynamics randomization helped to overcome the simulation-to-reality transfer problem where control policies would learn to do well in simulation by exploiting inaccuracies in the simulation, and then would fail when used in the real world.

Additionally, imitation learning methods like behavior cloning and inverse reinforcement learning enabled robotics control based on expert demonstrations.

4. Generalization & Robotic Transformers

Most recently, we have applied learnings from the success of training LLMs to robotics to yield a new wave of frontier robotics models.

Specifically, we have started to train large transformers on internet-scale data and tune them specifically to the robotics problem. This has led to state-of-the-art results.

For example, the vision-language-action (VLA) model proposed by RT2 used an open source vision-language model (multi-modal LLM) pre-trained on internet data and fine-tuned it to understand robotic control.

This system was able to use the high-level reasoning and visual understanding capabilities from the VLMs and apply it to robotics, effectively solving the robotic planning problem and adding impressive generalization capabilities to robotic control.

It's hard to overestimate how much value VLMs have brought to robotic planning and reasoning capabilities; this has been a major unlock on the path toward general-purpose robotics.

At this point, all frontier robotics models are using a combination of the VLA + ACT architecture with their own combination of internet and manually collected data, with pi0 representing the most impressive publicly released model today with frontier generalization capabilities.Image
Step #3 ✅: Reframing the future of humanoid robotics (the most interesting part)

With all this context, it became much more clear to me what we need to accomplish in order to get to the goal of fully-autonomous general-purpose robotics.

Robotic perception and planning are largely solved problems - though there is plenty of room for improvement, modern robots demonstrate capabilities and generalization sufficient for many real-world tasks.

Meanwhile, achieving generalization in robotic control is currently the largest barrier to progress.

Current state-of-the-art robotic control systems show generalization to new objects, environments, and instructions, but they show very little ability to generalize to new manipulation skills.

This is no small problem: most real-world tasks require complex multi-step motor routines (often under-appreciated by humans because motor tasks are second-nature to us), so being unable to generalize to new manipulation skills means being unable to perform most tasks that aren't explicitly in the dataset.

Luckily, we know from recent progress in deep learning that we can just scale up our models to improve generalization.

However, applying scaling laws to robotics looks very different than with LLMs:
> In LLMs, we had the entire internet worth of data to train on.
> Once we realized scaling laws work, we already had sufficient data and compute necessary to scale up parameters and get better models.
> In robotics, we have plenty of room to scale up compute and parameters.
> However, we are lacking the data to train on.

In fact, the current scale of data being used to train robotics models is likely orders of magnitude too small to achieve full generalization (covered more in depth in the full repository).

So how do we generate the necessary amount of data?

There are 3 approaches that are currently viable:
1. Internet Data - Repurpose current internet data for robotics training. This is hard because robots usually require data specifically from the same camera angles, joints, and actuators on the robot for training.
2. Simulation - Training in simulation offers massive parallelization and access to an unbounded amount of data. However, simulations currently lack the complexity that real-world data affords.
3. Real-World Data - The best current approach to collect data with sufficient complexity is to collect it directly from the real world, with tele-operation. This is exactly why most companies have opted to take this approach.

Figuring out a strategy to collect enough tele-operation data to achieve general-purpose robotics is no easy task.

It will involve addressing the following challenges (each addressed more in the full repository):
1. Economic Self-Sufficiency - In order to sustainably collect enough data to achieve generalization through tele-operation, current robotics companies will need to find a way to make data collection economically self-sufficient (using some combination of labor arbitrage + doing jobs that humans can't/won't do).
2. Data Signal - Many robotics companies are opting to collect this data through deployments in industrial settings. However, another important consideration is that these robot deployments need to be collecting data with sufficient signal about the real world, including data about a variety of environments/scenarios. This means that robots must be deployed in contexts with sufficient variance.
3. Capital - Given the realities of current datasets and the scale of data necessary for true generalization, this approach will likely be highly capital intensive, requiring many years of spending to reach data scales necessary for truly autonomous general-purpose humanoids. This has many implications for how robotics companies must approach their data-collection (will these companies be able to sustain their burn for 5-10 years on venture capital alone?)

So in order to get to the promised state of humanoid robotics, it will require:
1. Constructing entire hardware supply chains and manufacturing processes
2. Collecting large amounts of data
3. Likely burning through capital for a long time (maybe more than 5-10 years) before really good autonomous robots are ready for the world

It is also worth noting that the internet scale datasets that we used to create frontier generative models were all created by network effects that played out over decades and created trillions of dollars of value for the world. This made it economically feasible to generate a dataset at such a scale.

If we had tried today to directly spend capital to create a similar dataset to train LLMs, it seems farfetched that we would be able to replicate something of comparable quality to the internet.

But this is similar to what we're trying to do for robotics today, and the robotic manipulation problem appears to be orders of magnitude more complex than learning human language.

With all this context, we now have a more grounded perspective of current robotics capabilities, the challenges in the way of reaching humanoid robotics, and a more realistic timeline for when we might achieve this technology.

This is not meant to provide a pessimistic perspective on robotics, but rather is meant to provide a perspective grounded in what current research frontiers suggest.

There's still much to cover about this topic (this tweet is already way too long), but I just tried to cover the high-level details here.

For those curious to learn about this topic in-depth, I would highly recommend reading through the full post below!

In there, I go into far more depth about:
> The specific details of the research breakthroughs that have gotten us to modern robotics
> How much data do we need to achieve general-purpose robotics?
> What is the correct strategy to collect this data?
> How long will it take to collect this data?
> Who is most likely to win the humanoid arms race?
> What does the robotics problem teach us about the human brain and our biology?
> etc.Image
For those curious, here's the complete list of papers/links I used to learn about the history of robotics progress:

Perception
> SLAM - ieeexplore.ieee.org/stamp/stamp.js…
> SIFT - cs.ubc.ca/~lowe/papers/i…
> ORB-SLAM - arxiv.org/pdf/1502.00956
> DROID-SLAM - arxiv.org/pdf/2108.10869

Planning
> A-star - ai.stanford.edu/~nilsson/Onlin…
> PRM - cs.cmu.edu/~motionplannin…
> RRT - msl.cs.illinois.edu/~lavalle/paper…
> CHOMP -ri.cmu.edu/pub_files/2009…
> TrajOpt - roboticsproceedings.org/rss09/p31.pdf
> STRIPS - ai.stanford.edu/~nilsson/Onlin…
> Max-Q - arxiv.org/pdf/cs/9905014
> PDDL - arxiv.org/pdf/1106.4561
> ASP - cs.utexas.edu/~vl/papers/wia…
> Clingo - arxiv.org/pdf/1405.3694

Reinforcement Learning
> MDP - arxiv.org/pdf/cs/9605103
> Atari - arxiv.org/pdf/1312.5602
> A3C - arxiv.org/pdf/1602.01783
> TRPO - arxiv.org/pdf/1502.05477
> GAE - arxiv.org/pdf/1506.02438
> PPO - arxiv.org/abs/1707.06347
> DDPG - arxiv.org/pdf/1509.02971
> SAC - arxiv.org/pdf/1801.01290
> Curiosity - arxiv.org/pdf/1808.04355

Simulation
> MuJoCo - homes.cs.washington.edu/~todorov/paper…
> Domain Randomization - arxiv.org/pdf/1703.06907
> Dynamics Randomization - arxiv.org/pdf/1710.06537
> OpenAI Dexterous Manipulation - arxiv.org/pdf/1808.00177
> Simulation Optimization - arxiv.org/pdf/1810.05687

Imitation Learning
> ALVINN - proceedings.neurips.cc/paper/1988/fil…
> DAgger - arxiv.org/pdf/1011.0686
> IRL - ai.stanford.edu/~ang/papers/ic…
> GAIL - arxiv.org/pdf/1606.03476
> MAML - arxiv.org/pdf/1703.03400
> One-Shot - arxiv.org/pdf/1703.07326

Locomotion
> ZMP - researchgate.net/publication/22…
> Preview Control - researchgate.net/publication/40…
> Biped - arxiv.org/pdf/2401.16889
> Quadruped - arxiv.org/pdf/2010.11251

Generalization
> E2E - arxiv.org/pdf/1504.00702
> BC-Z - arxiv.org/pdf/2202.02005
> SayCan - arxiv.org/pdf/2204.01691
> RT1 - arxiv.org/pdf/2212.06817
> ACT - arxiv.org/pdf/2304.13705
> VLA - arxiv.org/pdf/2307.15818
> Pi0 - physicalintelligence.company/download/pi0.p…
My final output (complete synthesis + predictions)

Here's the repository of my full effort with:
> a complete in-depth synthesis of the research breakthroughs in robotics
> my perspective on the future of humanoids

github.com/adam-maj/robot…
Finally, I want to thank @BainCapVC (especially @kevinzhang, @RonMiasnik) for supporting me on my deep dives!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with adammaj

adammaj Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @MajmudarAdam

Oct 16
I’ve spent the past ~4 weeks understanding how the energy industry will define our future.

I synthesized everything I learned into a single resource.

My biggest takeaways in thread + complete resource at the end👇 Image
Part 1 ⚛: Understanding the fundamentals of energy

My goal was to understand the full picture of energy, which is rarely covered in one place because of how big the industry is.

I thought this would be particularly interesting given the recent focus on energy as a primary constraint limiting progress in deep learning.

I wanted to go deep on:
1. the fundamentals of energy physics
2. how energy usage has shaped human civilization
3. modern energy usage
4. energy production
5. energy distribution
6. energy storage
7. how energy relates to geopolitics
8. current trends in the energy industry
9. how energy relates to the future of technology

I started by focusing on energy physics to understand:
> What actually is energy?
> How do energy exchanges define what we can do?
> How do the laws of thermodynamics limit our energy consumption?

I learned that energy is much more fundamental and intuitive than I expected:
> Energy is simply the ability to cause change in the universe
> All energy is a result of particles interacting with the 4 fundamental forces
> Energy exchange is really a change in the types of things carrying energy
> The amount of useful energy in the universe is always decreasing
> Energy is the fundamental constraint on economic growth, to a degree I didn’t expect

This gave me a much clearer framework to contextualize the rest of my deep dive.Image
Part 2 🔥: Understanding humanities energy needs

Next, I focused on understanding how our energy usage has changed over the course of human history.

This made it clear how critical energy has been to the development of modern civilization.

It became clear that all of human history can be framed as the result of a series of energy transitions:
> Fire - burning biomass allowed us to cook foods, expanding the energy sources that we could consume
> Agriculture - farming, irrigation, and animal labor allowed us to develop a consistent energy source for the first time. this let us settle down from being hunter gatherers and created an energy surplus for the first time in history. as a result, agrarian society emerged.
> Specialization - the energy surplus created by agriculture is what enabled people to spend their time on things other then looking for energy. people started to specialize, and trade emerged.
> Transportation - we harnessed energy from the wind (sailing) and animals (horses) to move around the world. this enabled broader trade that allowed market societies to emerge.
> Coal & the steam engine - coal provided the first abundant & dense energy source good enough to support the function of machines. this is what triggered the industrial revolution.
> Electricity - the creation of power plants and grids allowed the instant transport of energy, electrifying society. this enabled the light bulb, refrigerator, and eventually the semiconductor that led to the information age and modern computing

Each of these transitions can be viewed as the result of our increasing ability to:
1. capture more energy
2. do more with less energy

Next, I looked at the technologies that enable us to use energy to do everything we need in modern society.

It became clear that all energy usage can actually be reduced to just 4 fundamental energy needs:
1. Heating
2. Lighting
3. Movement
4. Computation

All other energy usage is actually made up of a combination of these needs.

Finally, I looked into global energy consumption to understand the needs that our modern energy systems have to serve.

The most useful stat on modern energy consumption is that humanity requires ~18 terrawatts (TW) of power (energy in joules required per second) to function.

This provides valuable context to evaluate modern energy production methods based on their ability to fulfill this need.Image
Read 11 tweets
May 25
I've spent the past ~3 weeks going through the entire history of deep learning and reimplementing all the core breakthroughs.

It has completely changed my beliefs about deep learning progress and where we're headed.

Progress tracker in thread (all resources at the end) 👇Image
Step #1 ✅: Learning the fundamentals of deep learning from papers

I wanted to learn about the fundamentals of deep learning directly from the source of progress: the critical papers that have gotten us from simple feed-forward networks to models like GPT-4o.

I suspected that this would show me broader trends and intuitions that aren't obvious when learning about AI through popular courses, textbooks, or public narratives.

This approach turned out to be critical.

I focused on learning about the following trail of breakthroughs that led us to where we are today:

(the repo later in this thread includes my in-depth explanations of core intuitions, math, and implementations (when relevant) for each of these, for anyone curious)

Early Neural Networks & CNNs
> Backpropagation - The foundational algorithm that enabled deep-learning and gradient descent
> LeNet - An early convolutional neural net that showed signs of beating traditional ML models at digit recognition
> AlexNet - Completely changed the history of deep learning and brought new focus onto the field by beating the state-of-the-art for image classification. This is where the broader community started taking deep learning seriously.
> U-Net - An effective image-to-image architecture based on the CNN that's now used in all diffusion models

Optimization & Regularization
> Weight Decay - The earliest improvement to make models generalize by penalizing them for large weights
> ReLU - Game-changing activation function that enabled sparse representations in neural networks for the first time
> Residuals - Solved the vanishing and exploding gradient problems, enabling deeper networks
> Dropout - Solved regularization by forcing neurons to learn robust representations (via blocking the effects of random neurons during training)
> BatchNorm - Solved the "internal covariate shift" problem which also enabled deeper networks
> LayerNorm - Made BatchNorm usable for sequential models
> GELU - A modern activation function merging the value of ReLU & Dropout and used in most models today
> Adam - Adding momentum to stochastic gradient-descent to make models converge faster

Sequence Modeling
> RNN - Introduced the idea of sequence-modeling, which started the path that led us to the transformer
> LSTM - Made RNNs actually useful by introduced "gated" memory to learn long-term relationships between inputs
> The Forget Gate - Added the ability for LSTMs to "learn to forget" which made them capable of processing long sequences of text
> Word2Vec (& Phrase2Vec) - Introduced the first popular text embedding models, starting the trend that led us to the creation of CLIP
> Encoder-Decoder & Seq2Seq - Powerful text models built on RNNs and LSTMs (for machine translation) that directly set the stage for the transformer
> Attention - The core inductive bias behind transformers. It was initially built on-top of RNN/LSTM based models. Hence, "attention is all you need" showed that you could remove everything else
> Mixture of Experts - The first effective implementation of "conditional computation" for neural networks that led to one of the advancements behind GPT-4

Transformers
> Transformer - The critical paper that completely changed the history of deep learning again, introducing an architecture capable of learning complex relationships & (importantly) highly parallelizable in training.
> BERT (& RoBERTa) - The first model to successfully execute the pre-training & fine-tuning paradigm, showing us what transformers were capable of
> T5 - Introduced the idea of the general "text-to-text" learning task that now underlies all LLMs
> GPT-2 & GPT-3 - No explanation needed. Most interesting here was their hard bet on the scaling laws (before they were consensus) and being right.
> LoRA - An efficient method for fine-tuning models (which also showed us something interesting about
> RLHF & InstructGPT - GPT-3 didn't really reach the mainstream until the creation of ChatGPT, enabled by the successful fine-tuning of an "assistant mode" introduced by these papers
> Vision Transformer - Introduced the ability for transformers to process images in "patches" which became critical for multi-modality

Image Generation
> GAN - The first effective approach to image synthesis, using the game-theoretic "adversarial" optimization of a generator and discriminator network
> VAE (& VQ-VAE, VQ-VAE-2) - Probabilistic approach to image synthesis that constrains the model to form low-dimensional representations of images, forcing the separation of high-level features and details
> Diffusion (& Denoising Diffusion, etc.) - Enabled the best current state-of-the-art image synthesis
> CLIP - The embedding model that first introduced the possibility for multi-modality, by compressing understanding of images and captions into a single representation space
> DALL E (& DALL E 2) - Building on VAEs, CLIP, and diffusion models to create state-of-the-art controlled image synthesis modelsImage
Step #2 ✅: Building core intuitions from each paper

I started by trying to understand the core intuitions and math for each paper.

Going through the early CNN, optimization, and regularization papers, this process was straightforward.

Each of these papers builds directly on top of the core of DNNs, and shows empirically the approaches that solved specific problems with scaling neural networks. Assuming a strong fundamental understanding of backpropagation, they were mostly intuitive.

Specifically, the framework of thinking about each advancement in-terms of how it affects gradient flow in a neural network was particularly effective.

The specific math behind RNNs & LSTMs was a bit more challenging (took some time to fully understand how gradient-flow are manipulated by LSTM gates), but aside from that, the sequence modeling and transformer sections were also intuitive.

Many of the advancements in transformers after the original Attention Is All You Need paper are about modifying training objectives, small implementation details, and just scaling up the models.

However, when I got to the generative models section, I got hit with a completely new level of difficulty.

Getting through the papers for Variational Auto Encoders and Diffusion models was brutal. Diffusion alone took me a few days to fully wrap my head around all the math (especially the equations in the original thermo diffusion & denoising diffusion papers).

Because these models draw their inspiration from thermodynamics (Langevin dynamics), they deal with concepts far more complex than the remaining history of deep learning.

It was painful getting through this part, but felt great at the end when I was finally able to grasp the math.Image
Read 8 tweets
Apr 25
I've spent the past ~2 weeks building a GPU from scratch with no prior experience. It was way harder than I expected.

Progress tracker in thread (coolest stuff at the end)👇Image
Step 1 ✅: Learning the fundamentals of GPU architectures

I started by trying understand how modern GPUs function down to the architecture level.

This was already harder than I anticipated - GPUs are proprietary tech, so there are few detailed learning resources online.

I started out trying to understand the GPU software pattern by learning about NVIDIA's CUDA framework.

This helped me understand the Same Instruction Multiple Data (SIMD) programming pattern used to write GPU programs called kernels.

With this context, I dove into learning about the core elements of GPUs:
> Global Memory - external memory that stores data & programs accessing it is a huge bottleneck & constraint on GPU programming
> Compute Cores - the main compute units that execute kernel code in different threads in parallel
> Layered Caches - caches to minimize global memory access
> Memory Controllers - handles throttling requests to global memory
> Dispatcher - the main control unit of the GPU that distributes threads to available resources for execution

And then within each compute core, I learned about the main units:
> Registers - dedicated space to store data for each thread.
> Local/Shared Memory - memory shared between threads to pass data around to each other
> Load-Store Unit (LSU) - used to store/load data from global memory
> Compute Units - ALUs, SFUs, specialized graphics hardware, etc. to perform computations on register values
> Scheduler - manage resources in each core and plans when instructions from different threads get executed - much of GPU complexity lies here.
> Fetcher - retrieves instructions from program memory
> Decoder - decode instructions into control signals

This process gave me a good high-level understanding of the different units in modern GPUs.

But with so much complexity, I knew I had to cut down the GPU to the essentials for my own design or else my project would be extremely bloated.Image
Step 2 ✅: Creating my own GPU architecture

Next, I started to create my own GPU architecture based on what I learned.

My goal was to create a minimal GPU to highlight the core concepts of GPUs and remove unnecessary complexities so other's could learn about GPUs more easily.

Designing my own architecture was an incredible exercise in deciding what really matters.

I went through several iterations of my architecture throughout this process as I learned more by building.

I decided to highlight the following in my design:
> Parallelization - How is the SIMD pattern implemented in hardware?
> Memory Access - How do GPUs handle the challenges of accessing lots of data from slow & limited bandwidth memory?
> Resource Management - How do GPUs maximize resource utilization & efficiency?

I wanted to highlight the broader use-cases of GPUs for general purpose parallel computing (GPGPUs) & ML so I decided to focus on the core functionality rather than graphic specific hardware.

After many iterations, I finally landed on the following architecture that I implemented in my actual GPU (everything is in it's simplest form here)Image
Read 11 tweets
Apr 11
I've spent the past ~2 weeks trying to make a chip from scratch with no prior experience. It's been an incredible source of learning so far.

Progress tracker in thread (coolest stuff at the end)👇Image
Step 1 ✅: Learning the fundamentals of chip architecture

I started by learning how a chip works all the way from binary to C.

This part was critical.

In order to design a chip, you need a strong understanding of all the architecture fundamentals as you constantly work with logic, gates, memory, etc.

I reviewed the entire stack:
> Binary - using voltage to encode data
> Transistors - using semi-conductors to create a digital switch
> CMOS - using transistors to build the first energy efficient inverter
> Gates - using transistors to compute higher level boolean logic
> Combinatorial Logic - using gates to build boolean logic circuits
> Sequential Logic - using combinatorial logic to persist data
> Memory - using sequential logic to create storage systems for data
> CPU - combining memory & combinatorial logic to create the Von Neumann architecture, the first Turing Complete step (ignoring memory constraints)
> Machine Language - how instructions in program memory map to control signals across the CPU
> Assembly - how assembly maps directly to the CPU
> C - how C compiles down into assembly and then machine language

Coming from a software background, fully understanding the connection between each of these layers in depth unlocked so many intuitions for me.

Planning to write a post about the full stack of compute soon (including all these layers + the relevant layers of chip fabrication & design)Image
Step 2 ✅: Learning the fundamentals of chip fabrication

Next, I learned about how transistors are actually fabricated.

Chip design tools are all built around specific fabrication processes (called process nodes), so I needed to understand this to fully grasp the chip design flow.

I focused on learning about:
> Materials - There are a huge number of materials required for semiconductor fabrication, including semi-conductors, etchants, solvents, etc. each with specific qualities that merit their use.
> Wafer Preparation - Creating silicon wafers with poly-silicon crystals and "growing" Silicon Dioxide layers on them
> Patterning - The 10 step process using layering (oxidation/layer deposition/metallization), photo-lithography, and etching to create the actual transistor patterns on the chip
> Packaging - Packaging the chips in protective covers to prevent corruption, create I/O interfaces, help w/ heat dissipation, etc.
> Contamination - Was interesting learning about how big of a focus contamination control is/how critical it becomes as transistor sizes decrease

Within each of these topics, there's so much depth to go into - each part of the process has many different approaches, each using different materials & machines.

I focused more on getting a broad understanding of the important parts of the process.

The most important intuition here is that chips are produced by defining the layout of different layers.

The design of these layers is what you produce as the output of the chip design (EDA) process.Image
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(