Ethan Profile picture
trying to feel the magic. cofounder/directing research at @leonardoai_ (now at @canva).
Mar 29 6 tweets 2 min read
Compared to most optimizer research, Muon comes a bit out of left field. I thought I'd share some notes on what might be happening under the hood as it doesn't appear to be traditional preconditioning. Instead, my guess is it lies in amplifying noise and maintaining relativity. Image Weight distributions of trained models move quite a lot from initialization. Many weights don't move too far from the init range, but also a sizable portion deviate pretty far. Image
Feb 1, 2024 17 tweets 7 min read
OP is correct that SD VAE deviates from typical behavior.

but there are several things wrong with their line of reasoning and the really unnecessary sounding of alarms. I did some investigations in this thread to show you can rest assured, its really not a big deal. Image first of all, the irregularity of the VAE is mostly intentional. Typically the KL term allows for more navigable latent spaces and more semantic compression. It ensures that nearby points map to similar images. In the extreme, it itself can actually be a generative model. Image