Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Charles 🎉 Frye

@charles_irl

Feb 25, 2022 • 10 tweets • 4 min read • Read on X

Scrolly

https://twitter.com/arankomatsuzaki/status/1494488254304989228

There's been some back-and-forth about this paper on getting gradients without doing backpropagation, so I took a minute to write up an analysis on what breaks and how it might be fixed.

tl;dr: the estimated gradients are _really_ noisy! like wow

charlesfrye.github.io/pdfs/SNR-Forwa…

https://twitter.com/arankomatsuzaki/status/1494488254304989228

The main result I claim is an extension of Thm 1 in the paper. They prove that the _expected value_ of the gradient estimate is the true gradient, and I worked out the _variance_ of the estimate.

It's big! Each entry has variance equal to the entire true gradient's norm😬

(Sketch of the proof: nothing is correlated, everything has 0 mean and is symmetric around the origin, the only relevant terms are chi-squared r.v.s with known variances that get scaled by the gradient norms. gaussians are fun!)

Informally, we say that "noisy gradients" are bad and slow down learning.

So I looked at the "signal to noise ratio" between the true gradient value and the variance of the estimate.

It's bad! If you're scaling your gradients properly, it gets worse as you add parameters.

(FYI, I sanity-checked my result by pulling gradients from a PyTorch MNIST example and checking the true gradient's norm against the average variance of each entry, which should be equal. And they were super close!)

I give some intuitions for the variance, and for the general distribution of the forward gradients (g), based on product distributions and large random vectors.

In that paragraph I mention some simulations (related to the sanity check above). I didn't include the plots, but here they are! The alignment between the forward grad and the true gradient is all over the place -- and way worse than randomness from minibatch effects.

More could've been said about the weaknesses of FG in the paper, but I don't think it's a useless idea.

So I wrote some suggestions. For example, if you already have a good prior about the gradient direction, maybe you could sample from it instead of a unit normal?

@theshawwn

@theshawwn i saw you expressing interest in the forward gradient stuff and reasonable skepticism about the value of MNIST experiments

this is a fairly rigorous argument that the gradient noise is too high for fwd grads, as is, to work in large models

For more details, especially on the derivation of the variance, see this short note I wrote up: charlesfrye.github.io/pdfs/SNR-Forwa…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @charles_irl

Charles 🎉 Frye

@charles_irl

Jul 2

Two years ago, I built my first Modal app -- a diffusion-based QR code generator.

The results were sometimes good, sometimes terrible.

It's a common story: a cool AI demo that's not robust enough to be useful.

Here's how we engineered our way from the left image to the right.

FYI this thread is a summary of a blog post -- head there for a lot more detail!

The title gives away the game. We built solid evals and then we used those evals to unlock inference-time compute scaling.

modal.com/blog/qart-code…

The core idea is unchanged since @nhciao came up with it two years ago: use a module called a ControlNet to modulate the output of a diffusion model so that the brightness and darkness patterns in the image encode QR data.

Read 13 tweets

Charles 🎉 Frye

@charles_irl

Dec 12, 2024

I think programming GPUs is too hard. Part of the problem is sprawling, scattered documentation & best practices.

Over the past few months, we’ve been working to solve that problem, putting together a “Rosetta Stone” GPU Glossary.

And now it’s live!

My take-aways in thread.

The heart of the CUDA stack, IMO, is not anything named CUDA: it’s the humble Parallel Thread eXecution instruction set architecture, the compilation target of the CUDA compiler and the only stable interface to GPU hardware.

modal.com/gpu-glossary/d…

This is obvious in hindsight. The ISA is where machines make contact with programs and it fundamentally divides the responsibilities of the hardware engineers and software engineers. This is true in a way even for a virtual ISA like PTX.

Read 13 tweets

Charles 🎉 Frye

@charles_irl

Aug 5, 2024

Last week @brad19brown, @jordanjuravsky, & co-authors released a paper on inference-time scaling laws that enable small LMs to beat the big boys.

So this weekend, @HowardHalim & I dropped everything to run their analysis on a new model + new data.

Success 😎

Why this matters:

Details of our work and repro code on the Modal blog.

All you need are @modal_labs and @huggingface credentials! And it's free: it fits in the $30/month in Modal's free tier.modal.com/blog/llama-hum…

First: we are bad at using language models.

They are statistical models of Unicode sequences. We know that sequential sampling is hard, but (driven by the economics of inference service providers) we ignore that when sampling from LMs and sample a single sequence greedily.

Read 19 tweets

Charles 🎉 Frye

@charles_irl

Nov 30, 2022

https://twitter.com/npew/status/1598016510588354560

a lot more fun to use than the classic playground interface, which makes interactions like this one more delightful 😎

https://twitter.com/npew/status/1598016510588354560

(please do not park your car on a volcano, even if you have an e-brake)

Zero-shot, the responses can be a bit "beige" and boring,

Read 4 tweets

Charles 🎉 Frye

@charles_irl

Nov 22, 2022

@NeelNanda5

I had a delightful session talking through the paper "In-Context Learning and Induction Heads" with author @NeelNanda5.

It's part of a long research thread, one of my favorites over the last five years, on "reverse engineering" DNNs.

The core claim of the paper is that a large fraction of the in-context learning behavior that makes contemporary transformer LLMs so effective comes from a surprisingly simple type of circuit they call an _induction head_.

In the video, Neel and I talk through the context of this claim and some of the phenomenological evidence for it.

In the process, I was delighted to discover that we share a deep love for and perspective informed by the natural sciences.

Read 6 tweets

Charles 🎉 Frye

@charles_irl

Nov 21, 2022

@modal_labs

last week @modal_labs made A100 GPUs available

so on Friday i dropped everything to play with them

in hours i had a CLI tool that could make @StabilityAI art of the new puppy in my life, Qwerty

by Sunday i had multiple autoscaling pet-art-generating web apps -- and so can you!

@gottapatchemall

context: A100s are beefy GPUs, and they have enough VRAM to comfortably train models, like Stable Diffusion, that generate images from text

if you can train the models, you can "teach" them proper nouns -- here "Qwerty", the name of my roommate @gottapatchemall's puppy (below)

A100s are expensive and finicky, and training on smaller GPUs (like my home 3070) can be painful

but Modal, a new cloud-native development platform, has them available, and easily -- you just add some decorators and classes in your Python code

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Charles 🎉 Frye

Try unrolling a thread yourself!

More from @charles_irl

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!