Latest Twitter Threads by @torchcompiled on Thread Reader App

Mar 29 • 6 tweets • 2 min read

Compared to most optimizer research, Muon comes a bit out of left field. I thought I'd share some notes on what might be happening under the hood as it doesn't appear to be traditional preconditioning. Instead, my guess is it lies in amplifying noise and maintaining relativity.

Weight distributions of trained models move quite a lot from initialization. Many weights don't move too far from the init range, but also a sizable portion deviate pretty far.

Feb 1, 2024 • 17 tweets • 7 min read

OP is correct that SD VAE deviates from typical behavior.

but there are several things wrong with their line of reasoning and the really unnecessary sounding of alarms. I did some investigations in this thread to show you can rest assured, its really not a big deal.

https://twitter.com/Norod78/status/1753032815401107491

first of all, the irregularity of the VAE is mostly intentional. Typically the KL term allows for more navigable latent spaces and more semantic compression. It ensures that nearby points map to similar images. In the extreme, it itself can actually be a generative model.

Share this page!

Enter URL or ID to Unroll