Charles 🎉 Frye Profile picture
Feb 25, 2022 10 tweets 4 min read Read on X
There's been some back-and-forth about this paper on getting gradients without doing backpropagation, so I took a minute to write up an analysis on what breaks and how it might be fixed.

tl;dr: the estimated gradients are _really_ noisy! like wow

charlesfrye.github.io/pdfs/SNR-Forwa…
The main result I claim is an extension of Thm 1 in the paper. They prove that the _expected value_ of the gradient estimate is the true gradient, and I worked out the _variance_ of the estimate.

It's big! Each entry has variance equal to the entire true gradient's norm😬 Image
(Sketch of the proof: nothing is correlated, everything has 0 mean and is symmetric around the origin, the only relevant terms are chi-squared r.v.s with known variances that get scaled by the gradient norms. gaussians are fun!)
Informally, we say that "noisy gradients" are bad and slow down learning.

So I looked at the "signal to noise ratio" between the true gradient value and the variance of the estimate.

It's bad! If you're scaling your gradients properly, it gets worse as you add parameters. Image
(FYI, I sanity-checked my result by pulling gradients from a PyTorch MNIST example and checking the true gradient's norm against the average variance of each entry, which should be equal. And they were super close!) Image
I give some intuitions for the variance, and for the general distribution of the forward gradients (g), based on product distributions and large random vectors. Image
In that paragraph I mention some simulations (related to the sanity check above). I didn't include the plots, but here they are! The alignment between the forward grad and the true gradient is all over the place -- and way worse than randomness from minibatch effects. Image
More could've been said about the weaknesses of FG in the paper, but I don't think it's a useless idea.

So I wrote some suggestions. For example, if you already have a good prior about the gradient direction, maybe you could sample from it instead of a unit normal? Image
@theshawwn i saw you expressing interest in the forward gradient stuff and reasonable skepticism about the value of MNIST experiments

this is a fairly rigorous argument that the gradient noise is too high for fwd grads, as is, to work in large models
For more details, especially on the derivation of the variance, see this short note I wrote up: charlesfrye.github.io/pdfs/SNR-Forwa…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Charles 🎉 Frye

Charles 🎉 Frye Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @charles_irl

Aug 5
Last week @brad19brown, @jordanjuravsky, & co-authors released a paper on inference-time scaling laws that enable small LMs to beat the big boys.

So this weekend, @HowardHalim & I dropped everything to run their analysis on a new model + new data.

Success 😎

Why this matters: Image
Details of our work and repro code on the Modal blog.



All you need are @modal_labs and @huggingface credentials! And it's free: it fits in the $30/month in Modal's free tier.modal.com/blog/llama-hum…
First: we are bad at using language models.

They are statistical models of Unicode sequences. We know that sequential sampling is hard, but (driven by the economics of inference service providers) we ignore that when sampling from LMs and sample a single sequence greedily.
Read 19 tweets
Nov 30, 2022
a lot more fun to use than the classic playground interface, which makes interactions like this one more delightful 😎
(please do not park your car on a volcano, even if you have an e-brake)
Zero-shot, the responses can be a bit "beige" and boring,
Read 4 tweets
Nov 22, 2022
I had a delightful session talking through the paper "In-Context Learning and Induction Heads" with author @NeelNanda5.

It's part of a long research thread, one of my favorites over the last five years, on "reverse engineering" DNNs.
The core claim of the paper is that a large fraction of the in-context learning behavior that makes contemporary transformer LLMs so effective comes from a surprisingly simple type of circuit they call an _induction head_.
In the video, Neel and I talk through the context of this claim and some of the phenomenological evidence for it.

In the process, I was delighted to discover that we share a deep love for and perspective informed by the natural sciences.
Read 6 tweets
Nov 21, 2022
last week @modal_labs made A100 GPUs available

so on Friday i dropped everything to play with them

in hours i had a CLI tool that could make @StabilityAI art of the new puppy in my life, Qwerty

by Sunday i had multiple autoscaling pet-art-generating web apps -- and so can you! Image
context: A100s are beefy GPUs, and they have enough VRAM to comfortably train models, like Stable Diffusion, that generate images from text

if you can train the models, you can "teach" them proper nouns -- here "Qwerty", the name of my roommate @gottapatchemall's puppy (below) Image
A100s are expensive and finicky, and training on smaller GPUs (like my home 3070) can be painful

but Modal, a new cloud-native development platform, has them available, and easily -- you just add some decorators and classes in your Python code Image
Read 10 tweets
Mar 21, 2022
I recently presented a series of four reports over 40 years on system failure, ranging from a 1985 typewritten white paper on mainframe database crashes to a 2021 Zoom talk on outages in one of Google's ML-based ranking systems.

Here's a summary, with connections to reliable ML. Image
Each report was a post-hoc meta-analysis of post-mortem analyses: which "root causes" come up most often? Which take the most time to resolve?

Each captures 100 or more outages from a system using best practices of its era & modality at the largest scale. ImageImageImageImage
"Why Do Computers Stop" was the first in the series, by Jim Gray (standing, center), who pioneered transactional databases and the ACID principle in 80s.

It's clear that these ideas were informed by his close engagement with actual failure data. ImageImageImageImage
Read 25 tweets
Mar 3, 2022
last week i attended MLcon2.0 by @cnvrg_io and saw some great talks from all across the ML development stack

all of them are now available on-demand!

i'll call out some of my favorites here

cnvrg.io/mlcon-2
from @DivitaVohra, an overview of @Spotify's ML platform. super cool to hear how a product manager thinks about the problem of supporting ML systems

from @jeffboudier, an overview of the awesome work being done at @huggingface, with a focus on democratization of best practices, e.g. fast inference with Infinity

Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(