There's been some back-and-forth about this paper on getting gradients without doing backpropagation, so I took a minute to write up an analysis on what breaks and how it might be fixed.

tl;dr: the estimated gradients are _really_ noisy! like wow

charlesfrye.github.io/pdfs/SNR-Forwa…
The main result I claim is an extension of Thm 1 in the paper. They prove that the _expected value_ of the gradient estimate is the true gradient, and I worked out the _variance_ of the estimate.

It's big! Each entry has variance equal to the entire true gradient's norm😬 Image
(Sketch of the proof: nothing is correlated, everything has 0 mean and is symmetric around the origin, the only relevant terms are chi-squared r.v.s with known variances that get scaled by the gradient norms. gaussians are fun!)
Informally, we say that "noisy gradients" are bad and slow down learning.

So I looked at the "signal to noise ratio" between the true gradient value and the variance of the estimate.

It's bad! If you're scaling your gradients properly, it gets worse as you add parameters. Image
(FYI, I sanity-checked my result by pulling gradients from a PyTorch MNIST example and checking the true gradient's norm against the average variance of each entry, which should be equal. And they were super close!) Image
I give some intuitions for the variance, and for the general distribution of the forward gradients (g), based on product distributions and large random vectors. Image
In that paragraph I mention some simulations (related to the sanity check above). I didn't include the plots, but here they are! The alignment between the forward grad and the true gradient is all over the place -- and way worse than randomness from minibatch effects. Image
More could've been said about the weaknesses of FG in the paper, but I don't think it's a useless idea.

So I wrote some suggestions. For example, if you already have a good prior about the gradient direction, maybe you could sample from it instead of a unit normal? Image
@theshawwn i saw you expressing interest in the forward gradient stuff and reasonable skepticism about the value of MNIST experiments

this is a fairly rigorous argument that the gradient noise is too high for fwd grads, as is, to work in large models
For more details, especially on the derivation of the variance, see this short note I wrote up: charlesfrye.github.io/pdfs/SNR-Forwa…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Charles 🎉 Frye

Charles 🎉 Frye Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @charles_irl

Feb 26
Read through these awesome notes by @chipro and noticed something interesting about distribution shifts: they form a lattice, so you can represent them like you do sets, ie using a Venn diagram!

I find this view super helpful for understanding shifts, so let's walk through it. Image
(inb4 pedantry: the above diagram is an Euler diagram, not a Venn diagram, meaning not all possible joins are represented. that is good, actually, for reasons to be revealed!)
From the notes: joint distribution of data X and targets Y is shifting. We can decompose the joint into two pieces (marginal and conditional) in two separate ways (from Y or X).

There are four major classes of distribution shift, defined by which pieces vary and which don't. Image
Read 24 tweets
Nov 18, 2021
the final video for the @weights_biases Math4ML series, on probability, is now up on YouTube!

@_ScottCondron and I talk entropies, divergence, and loss functions

🔗:
this is the final video in a four-part series of "exercise" videos, where Scott and I work through a collection of Jupyter notebooks with automatically-graded Python coding exercises on math concepts

read more in this 🧵

each exercise notebook has a corresponding lecture video.

the focus of the lectures is on intuition, and in particular on intuition that i think programmers trying to get better at ML will grok
Read 7 tweets
Nov 8, 2021
New video series out this week (and into next!) on the @weights_biases YouTube channel.

They're Socratic livecoding sessions where @_ScottCondron and I work through the exercise notebooks for the Math4ML class.

Details in 🧵⤵️
Socratic: following an ancient academic tradition, I try to trick @_ScottCondron into being wrong, so that students can learn from mistakes and see their learning process reflected in the content.
(i was inspired to try this style out by the @PyTorchLightnin Master Class series, in which @_willfalcon and @alfcnz talk nitty-gritty of DL with PyTorch+Lightning while writing code. strong recommend!)

Read 8 tweets
Aug 24, 2021
If you're like me, you've written a lot of PyTorch code without ever being entirely sure what's _really_ happening under the hood.

Over the last few weeks, I've been dissecting some training runs using @PyTorch's trace viewer in @weights_biases.

Read on to learn what I learned!
I really like the "dissection" metaphor

a trace viewer is like a microscope, but for looking at executed code instead of living cells

its powerful lens allows you to see the intricate details of what elsewise appears a formless unity

kinda like this, but with GPU kernels:
number one take-away: at a high level, there's two executions of the graph happening.

one, with virtual tensors, happens on the CPU.

it keeps track of metadata like shapes so that it can "drive" the second one, with the real tensor data, that happens on the GPU.
Read 9 tweets
Jul 31, 2020
another great regular online talk series! they're talking about GPT-3 now
@realSharonZhou: sees opportunities in medicine for with "democratization" of design of e.g. web interfaces.

this could be key for healthcare providers who have clinical expertise and know what patients need but don't have web design skills.
@DrHughHarvey sees this as a step towards the holy grail of ML in radiology: a model that takes in an image and returns a full radiology report.

jump from GPT2 to GPT3 was just size. what might trillion-parameter models bring in other domains?
Read 8 tweets
Jul 24, 2020
1/hella

this 🧵 by @daniela_witten is a masterclass in both the #SVD and in technical communication on Twitter.

i want to hop on this to expand on the "magic" of this decomposition and show folks where the rabbit goes, because i just gave a talk on it this week!

🧙‍♂️🐇💨😱
tl;dr: the basic idea of the SVD works for _any_ function.

it's a three step decomposition:

- throw away the useless bits ⤵
- rename what remains 🔀
- insert yourself into the right context ⤴
also, if you're more of a "YouTube talk" than a "tweet wall" kinda person, check out the video version, given as part of the @weights_biases Deep Learning Salon webinar series

Read 19 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(