adammaj Profile picture
research @openai // ex: founding eng @thirdweb, residence @baincapvc // cs + neuro (on leave) @Penn
Dec 18, 2024 β€’ 7 tweets β€’ 21 min read
I've spent the past ~6 weeks going through the entire history of robotics and understanding all the core research breakthroughs.

It has completely changed my beliefs about the future of humanoid robotics.

My process + biggest takeaways in thread (full resource at the end) πŸ‘‡Image Step #1 βœ…: Learning the fundamentals of robotics from papers

Over the past 2 years, $100Ms has been deployed into the robotics industry to fund the humanoid arms race.

From twitter hype, public sentiment, and recent demos, it seemed to me that fully autonomous general-purpose robotics was right around the corner (~2-3 years away).

With this in mind, I decided to learn the fundamentals of robotics directly from the primary source: the series of research papers that have gotten us to the current wave of humanoid robotics.

As I got farther into the research, I realized that it pointed to a very different future, and potentially much longer timelines, than what current narratives may suggest (discussed in detail in the full resource at the end).

I initially focused on learning about the following topics/trail of breakthroughs that led us to where we are today:

(the repository at the end of the thread covers all of these + more in greater detail)

Classical Robotics
> SLAM - Simultaneous localization and mapping systems allow robots to convert various sensor data into a 3D map of the environment, and allow the robot to understand it's place within it.
> Hierarchical Task Planning - Early robotic planning systems used hierarchical/symbolic models to organize the series of tasks to execute.
> Path Planning - Sampling based path-finding algorithms used to find best-effort paths that avoid collisions in high-dimensional environments.
> Forward Kinematics/Dynamics - Using physics models to predict where the robot will move given specific actuator inputs.
> Inverse Kinematics/Dynamics - Using physics models to predict how to control actuators given a target position.
> Contact Modeling - Modeling the precise friction and torque forces at contact points to understand when objects are fully controlled (form closure) and how robotic movement will manipulate objects.
> Zero-Moment Point (ZMP) - The point where all the forces around a robots foot cancel each other, used in calculations to enable robot locomotion and balance.

Deep Reinforcement Learning
> Deep Q-Networks - The earliest successful deep RL algorithms that first rose to popularity with the success of DQN on Atari.
> A3C/GAE - Allow RL systems to learn from long-horizon rewards, critical for most robotic tasks with delayed feedback.
> TRPO/PPO - Introduced an effective method for step-sizing parameter updates, which is especially critical for training RL policies.
> DDPG/SAC - Introduced more-sample efficient RL algorithms that could reuse data multiple times, valuable for training real-world robotic control policies with expensive data collection
> Curiosity - Using curiosity as a reward signal to train RL systems in complex environments, which has proven valuable for recent robotic foundational models.

Simulation & Imitation Learning
> MuJoCo - Simulation software built specifically for the needs of robotics that introduced more accurate joint and contact modeling. Enabled a series of research breakthroughs in training robots in simulation.
> Domain Randomization - Randomizing objects, textures, lighting, and other environmental conditions in simulation to get robots to generalize to the complexity of real world environments.
> Dynamics Randomization - Randomizing the laws of physics in simulation to teach robots to treat the real world as just another random physics engine to generalize to, bridging the simulation-to-reality gap.
> Simulation Optimization - Optimizing the specific domain/dynamics randomization levels to enable robot policies to train quickly in simulation.
> Behavior Cloning - Learning control policies by trying to clone the behavior of a human demonstrator. This approach has now given rise to tele-operated robotics.
> Dataset Aggregation (DAgger) - Using a data-collection loop between RL algorithms and experts to create more robust datasets for imitation learning.
> Inverse Reinforcement Learning (IRL) - Learning RL policies from demonstrators by trying to guess their reward function (trying to understand what the demonstrators goals are).

Generalization & Robotic Transformers
> End-to-end Learning - Researchers started to train robotic control policies with a single integrated visual and motor system, rather than with separate components. This started the end-to-end integration trend that has continued today.
> Tele-operation (BC-Z, ALoHa) - Cheaper tele-operation hardware and better training methods enabled the first capable robots trained with large volumes of data from human demonstrators operating robots. This is now the most widely used method for training frontier robotic control systems.
> Robotic Transformer (RT1) - The first successful transformer based robotics model that used a series of images and a text prompt passed to a transformer to output action tokens that could operate actuators.
> Grounding Language Models (SayCan) - Trained an LLM to understand a robot's capabilities, allowing it to carry out robotic task planning based on images and text that was grounded in reality.
> Action Chunking Transformer (ACT) - A transformer based robotic architecture that introduces "action chunking" allowing the model to predict the next series of actions instead of just a single time step. This enabled far smoother motor control.
> Vision-Language-Action Models (RT2) - Used modern vision-language models (VLMs) pre-trained on internet-scale data and fine-tuned them with robotic action data to achieve state of the art results. Arguably the most impactful milestone in recent robotics research.
> Cross-embodiment (PI0) - Training robotics models that work on a variety of different hardware systems, allowing the model to generalize beyond an individual robot to understand broad manipulation. Combined ACT + VLA + diffusion into a single SOTA model.

Note that though I didn't focus on robotic hardware in this tweet because most recent progress in robotics has been on the software side, the hardware aspects are covered in the full repository at the end of the thread.Image
Oct 16, 2024 β€’ 11 tweets β€’ 12 min read
I’ve spent the past ~4 weeks understanding how the energy industry will define our future.

I synthesized everything I learned into a single resource.

My biggest takeaways in thread + complete resource at the endπŸ‘‡ Image Part 1 βš›: Understanding the fundamentals of energy

My goal was to understand the full picture of energy, which is rarely covered in one place because of how big the industry is.

I thought this would be particularly interesting given the recent focus on energy as a primary constraint limiting progress in deep learning.

I wanted to go deep on:
1. the fundamentals of energy physics
2. how energy usage has shaped human civilization
3. modern energy usage
4. energy production
5. energy distribution
6. energy storage
7. how energy relates to geopolitics
8. current trends in the energy industry
9. how energy relates to the future of technology

I started by focusing on energy physics to understand:
> What actually is energy?
> How do energy exchanges define what we can do?
> How do the laws of thermodynamics limit our energy consumption?

I learned that energy is much more fundamental and intuitive than I expected:
> Energy is simply the ability to cause change in the universe
> All energy is a result of particles interacting with the 4 fundamental forces
> Energy exchange is really a change in the types of things carrying energy
> The amount of useful energy in the universe is always decreasing
> Energy is the fundamental constraint on economic growth, to a degree I didn’t expect

This gave me a much clearer framework to contextualize the rest of my deep dive.Image
May 25, 2024 β€’ 8 tweets β€’ 20 min read
I've spent the past ~3 weeks going through the entire history of deep learning and reimplementing all the core breakthroughs.

It has completely changed my beliefs about deep learning progress and where we're headed.

Progress tracker in thread (all resources at the end) πŸ‘‡Image Step #1 βœ…: Learning the fundamentals of deep learning from papers

I wanted to learn about the fundamentals of deep learning directly from the source of progress: the critical papers that have gotten us from simple feed-forward networks to models like GPT-4o.

I suspected that this would show me broader trends and intuitions that aren't obvious when learning about AI through popular courses, textbooks, or public narratives.

This approach turned out to be critical.

I focused on learning about the following trail of breakthroughs that led us to where we are today:

(the repo later in this thread includes my in-depth explanations of core intuitions, math, and implementations (when relevant) for each of these, for anyone curious)

Early Neural Networks & CNNs
> Backpropagation - The foundational algorithm that enabled deep-learning and gradient descent
> LeNet - An early convolutional neural net that showed signs of beating traditional ML models at digit recognition
> AlexNet - Completely changed the history of deep learning and brought new focus onto the field by beating the state-of-the-art for image classification. This is where the broader community started taking deep learning seriously.
> U-Net - An effective image-to-image architecture based on the CNN that's now used in all diffusion models

Optimization & Regularization
> Weight Decay - The earliest improvement to make models generalize by penalizing them for large weights
> ReLU - Game-changing activation function that enabled sparse representations in neural networks for the first time
> Residuals - Solved the vanishing and exploding gradient problems, enabling deeper networks
> Dropout - Solved regularization by forcing neurons to learn robust representations (via blocking the effects of random neurons during training)
> BatchNorm - Solved the "internal covariate shift" problem which also enabled deeper networks
> LayerNorm - Made BatchNorm usable for sequential models
> GELU - A modern activation function merging the value of ReLU & Dropout and used in most models today
> Adam - Adding momentum to stochastic gradient-descent to make models converge faster

Sequence Modeling
> RNN - Introduced the idea of sequence-modeling, which started the path that led us to the transformer
> LSTM - Made RNNs actually useful by introduced "gated" memory to learn long-term relationships between inputs
> The Forget Gate - Added the ability for LSTMs to "learn to forget" which made them capable of processing long sequences of text
> Word2Vec (& Phrase2Vec) - Introduced the first popular text embedding models, starting the trend that led us to the creation of CLIP
> Encoder-Decoder & Seq2Seq - Powerful text models built on RNNs and LSTMs (for machine translation) that directly set the stage for the transformer
> Attention - The core inductive bias behind transformers. It was initially built on-top of RNN/LSTM based models. Hence, "attention is all you need" showed that you could remove everything else
> Mixture of Experts - The first effective implementation of "conditional computation" for neural networks that led to one of the advancements behind GPT-4

Transformers
> Transformer - The critical paper that completely changed the history of deep learning again, introducing an architecture capable of learning complex relationships & (importantly) highly parallelizable in training.
> BERT (& RoBERTa) - The first model to successfully execute the pre-training & fine-tuning paradigm, showing us what transformers were capable of
> T5 - Introduced the idea of the general "text-to-text" learning task that now underlies all LLMs
> GPT-2 & GPT-3 - No explanation needed. Most interesting here was their hard bet on the scaling laws (before they were consensus) and being right.
> LoRA - An efficient method for fine-tuning models (which also showed us something interesting about
> RLHF & InstructGPT - GPT-3 didn't really reach the mainstream until the creation of ChatGPT, enabled by the successful fine-tuning of an "assistant mode" introduced by these papers
> Vision Transformer - Introduced the ability for transformers to process images in "patches" which became critical for multi-modality

Image Generation
> GAN - The first effective approach to image synthesis, using the game-theoretic "adversarial" optimization of a generator and discriminator network
> VAE (& VQ-VAE, VQ-VAE-2) - Probabilistic approach to image synthesis that constrains the model to form low-dimensional representations of images, forcing the separation of high-level features and details
> Diffusion (& Denoising Diffusion, etc.) - Enabled the best current state-of-the-art image synthesis
> CLIP - The embedding model that first introduced the possibility for multi-modality, by compressing understanding of images and captions into a single representation space
> DALL E (& DALL E 2) - Building on VAEs, CLIP, and diffusion models to create state-of-the-art controlled image synthesis modelsImage
Apr 25, 2024 β€’ 11 tweets β€’ 9 min read
I've spent the past ~2 weeks building a GPU from scratch with no prior experience. It was way harder than I expected.

Progress tracker in thread (coolest stuff at the end)πŸ‘‡Image Step 1 βœ…: Learning the fundamentals of GPU architectures

I started by trying understand how modern GPUs function down to the architecture level.

This was already harder than I anticipated - GPUs are proprietary tech, so there are few detailed learning resources online.

I started out trying to understand the GPU software pattern by learning about NVIDIA's CUDA framework.

This helped me understand the Same Instruction Multiple Data (SIMD) programming pattern used to write GPU programs called kernels.

With this context, I dove into learning about the core elements of GPUs:
> Global Memory - external memory that stores data & programs accessing it is a huge bottleneck & constraint on GPU programming
> Compute Cores - the main compute units that execute kernel code in different threads in parallel
> Layered Caches - caches to minimize global memory access
> Memory Controllers - handles throttling requests to global memory
> Dispatcher - the main control unit of the GPU that distributes threads to available resources for execution

And then within each compute core, I learned about the main units:
> Registers - dedicated space to store data for each thread.
> Local/Shared Memory - memory shared between threads to pass data around to each other
> Load-Store Unit (LSU) - used to store/load data from global memory
> Compute Units - ALUs, SFUs, specialized graphics hardware, etc. to perform computations on register values
> Scheduler - manage resources in each core and plans when instructions from different threads get executed - much of GPU complexity lies here.
> Fetcher - retrieves instructions from program memory
> Decoder - decode instructions into control signals

This process gave me a good high-level understanding of the different units in modern GPUs.

But with so much complexity, I knew I had to cut down the GPU to the essentials for my own design or else my project would be extremely bloated.Image
Apr 11, 2024 β€’ 10 tweets β€’ 8 min read
I've spent the past ~2 weeks trying to make a chip from scratch with no prior experience. It's been an incredible source of learning so far.

Progress tracker in thread (coolest stuff at the end)πŸ‘‡Image Step 1 βœ…: Learning the fundamentals of chip architecture

I started by learning how a chip works all the way from binary to C.

This part was critical.

In order to design a chip, you need a strong understanding of all the architecture fundamentals as you constantly work with logic, gates, memory, etc.

I reviewed the entire stack:
> Binary - using voltage to encode data
> Transistors - using semi-conductors to create a digital switch
> CMOS - using transistors to build the first energy efficient inverter
> Gates - using transistors to compute higher level boolean logic
> Combinatorial Logic - using gates to build boolean logic circuits
> Sequential Logic - using combinatorial logic to persist data
> Memory - using sequential logic to create storage systems for data
> CPU - combining memory & combinatorial logic to create the Von Neumann architecture, the first Turing Complete step (ignoring memory constraints)
> Machine Language - how instructions in program memory map to control signals across the CPU
> Assembly - how assembly maps directly to the CPU
> C - how C compiles down into assembly and then machine language

Coming from a software background, fully understanding the connection between each of these layers in depth unlocked so many intuitions for me.

Planning to write a post about the full stack of compute soon (including all these layers + the relevant layers of chip fabrication & design)Image