Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Keenan Crane

@keenanisalive

Sep 6 • 14 tweets • 4 min read • Read on X

“Everyone knows” what an autoencoder is… but there's an important complementary picture missing from most introductory material.

In short: we emphasize how autoencoders are implemented—but not always what they represent (and some of the implications of that representation).🧵

A similar thing happens when (many) people learn linear algebra:

They confuse the representation (matrices) with the objects represented by those matrices (linear maps… or is it a quadratic form?)

With autoencoders, the first (and last) picture we see often looks like this one: a network architecture diagram, where inputs get “compressed”, then decoded.

If we're lucky, someone bothers to draw arrows that illustrate the main point: we want the output to look like the input!

This picture is great if you want to simply close your eyes and implement something.

But suppose your implementation doesn't work—or you want to squeeze more performance out of your data.

Is there another picture that helps you think about what's going on?

(Obviously: yes!)

Here's a way of visualizing the maps *defined by* an autoencoder.

The encoder f maps high-dimensional data x to low-dimensional latents z. The decoder tries to map z back to x. We *always* learn a k-dimensional submanifold M, which is reliable only where we have many samples z.

In regions where we don't have many samples, the decoder g isn't reliable: we're basically extrapolating (i.e., guessing) what the true data manifold looks like.

The diagram suggests this idea by “cutting off” the manifold—but in reality there’s no clear, hard cutoff.

Another thing made clear by this picture is that, no matter what the true dimension of the data might be, the manifold M predicted by the decoder generically has the same dimension as the latent space: it's the image of R^k under g.

So, the latent dimension is itself a prior.

It should also be clear that, unless the reconstruction loss is exactly zero, the learned manifold M only approximates (rather than interpolates) the given data. For instance, x does not sit on M, even though x̂ does.

(If M does interpolate all xᵢ, you're probably overfitting)

Finally, a natural question raised by this picture is: how do I sample/generate new latents z? For a “vanilla” autoencoder, there's no simple a priori description of the high-density regions.

This situation motivates *variational* autoencoders (which are a whole other story…).

Personally, I find both of these diagrams a little bit crowded—here's a simpler “representation” diagram, with fewer annotations (that might anyway be better explained in accompanying text).

Likewise, here's a simpler “implementation” diagram, that still retains the most important idea of an *auto*-encoder, namely, that you're comparing the output against *itself*.

If you want to use or repurpose these diagrams, the source files (as PDF) can be found at

cs.cmu.edu/~kmcrane/Autoe…

Of course, there will be those who say that the representation diagram is “obvious,” and “that's what everyone has in their head anyway.”

If so… good for you! If not, I hope this alternative picture provides some useful insight as you hack in this space.

[End 🧵]

https://x.com/keenanisalive/status/1961559905850028341

P.S. I should also mention that these diagrams were significantly improved via feedback from many folks from here and elsewhere.

Hopefully they account for some of the gripes—if not, I'm ready for the next batch! 😉

https://x.com/keenanisalive/status/1961559905850028341

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @keenanisalive

Keenan Crane

@keenanisalive

Aug 29

I can't* fathom why the top picture, and not the bottom picture, is the standard diagram for an autoencoder.

The whole idea of an autoencoder is that you complete a round trip and seek cycle consistency—why lay out the network linearly?

*Of course I do in reality know why people use this diagram: it fits into a common visual language used for neural networks.

But it misses some critical features (like cycle consistency). And often adds other nutty stuff—like drawing functions as complete bipartite graphs!

Like, yeah, I know that for a function from ℝⁿ to ℝᵐ each output can depend on any of the inputs. That's how a function works!

Maybe you can use some of the space in that diagram to help me understand what those particular functions mean, or aim to do?

Read 6 tweets

Keenan Crane

@keenanisalive

May 21

Fun new paper at #SIGGRAPH2025:

What if instead of two 6-sided dice, you could roll a single "funky-shaped" die that gives the same statistics (e.g, 7 is twice as likely as 4 or 10).

Or make fair dice in any shape—e.g., dragons rather than cubes?

That's exactly what we do! 1/n

Here's the paper, which is an industry-funded collaboration between my PhD student Hossein Baktash at @SCSatCMU, @nmwsharp at @nvidia, and Qingnan Zhou & @_AlecJacobson at @AdobeResearch.

2/n cs.cmu.edu/~kmcrane/Proje…

In a nutshell, we show that the resting poses & statistics of a rigid body are easily computed from its geometry, without any dynamical simulation.

This simple geometric model enables us differentiate through dice designs, & optimize their shapes to match target statistics. 3/n

Read 5 tweets

Keenan Crane

@keenanisalive

Apr 17

Here's a nice "proof without words":

The sum of the squares of several positive values can never be bigger than the square of their sum.

This picture helps make sense of how ℓ₁ and ℓ₂ norms regularize and sparsify solutions (resp.). [1/n]

These pictures are often batting around in my brain when I think about optimization/learning problems, but can take some time to communicate to students, etc. So, I thought I'd make some visualizations. [2/n]

Suppose we minimize the squared length of a vector x, equal to the sum of squares of its components.

To avoid the trivial solution x=0, we'll also require that the components sum to a nonzero value.

Equivalently: minimize the ℓ₂ norm ‖x‖₂, subject to ‖x‖₁=1. [3/n]

Read 13 tweets

Keenan Crane

@keenanisalive

Apr 7

We often use discretization to approximate continuous laws of physics, but it also goes the other way:

You can use continuous equations to approximate the behavior of discrete systems!

Here we'll see how electrical circuits can be modeled using the Laplace equation Δφ=0. [1/n]

The Laplacian Δ is central to numerous (continuous) physical equations like the heat equation, the wave equation, and so on.

I have a whole video about it here: [2/n]

The discrete or graph Laplacian L is typically viewed as a numerical approximation of Δ, giving the difference between the value ui at a node of a graph, and a weighted average of uj at all neighbors j:

(Lu)_i := Σ_j w_ij (ui - uj)

Here w_ij are edge weights. [3/n]

Read 23 tweets

Keenan Crane

@keenanisalive

Dec 9, 2024

Entropy is one of those formulas that many of us learn, swallow whole, and even use regularly without really understanding.

(E.g., where does that “log” come from? Are there other possible formulas?)

Yet there's an intuitive & almost inevitable way to arrive at this expression.

When I first heard about entropy, there was a lot of stuff about "free states" and "disorder." Or about the number of bits needed to communicate a message.

These are ultimately important connections—but it's not clear it's the best starting point for the formula itself.

A better starting point is the idea of "surprise."

In particular, suppose an event occurs with probability p. E.g., if the bus shows up on time about 15% of the time, p = 0.15.

How *surprising*, then, is an event with probability p? Let's call this quantity S(p).

Read 17 tweets

Keenan Crane

@keenanisalive

Nov 14, 2024

We often think of an "equilibrium" as something standing still, like a scale in perfect balance.

But many equilibria are dynamic, like a flowing river which is never changing—yet never standing still.

These dynamic equilibria are nicely described by so-called "detailed balance"

In simple terms, detailed balance says that if you have less "stuff" at point x, and more "stuff" at point y, then to maintain a dynamic equilibrium, the fraction of stuff that moves from x to y needs to be bigger than the fraction that moves from y to x.

Detailed balance is also the starting point for algorithms that efficiently generate samples of a given distribution, called "Markov chain Monte Carlo" algorithms.

The idea is to "design" a random procedure that has the target distribution as the equilibrium distribution.

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Keenan Crane

Try unrolling a thread yourself!

More from @keenanisalive

Keenan Crane

Keenan Crane

Keenan Crane

Keenan Crane

Keenan Crane

Keenan Crane

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!