What do polar coordinates, polar matrix factorization, & Helmholz decomposition of a vector field have in common?
They’re all implied by Brenier’s Theorem: a cornerstone of Optimal Transport theory. It’s a fundamental decomposition result & rly deserves to be better known.
1/7
Brenier's Thm ('91): A non-degenerate vector field
u: Ω ∈ ℝⁿ → ℝⁿ has a unique decomposition
u = ∇ϕ∘s
where ϕ is a convex potential on Ω, and s is measure-preserving (think density → density).
Here s is a multi-dimensional “rearrangement” (a sort in 1D)
2/7
In optimal transport, Brenier's thm implies existence, uniqueness & monotonicity of an OT map w.r.t L₂ cost, between two given densities p(x) & q(y). Let
u: ℝⁿ → ℝⁿ & c(x,y) = ‖x − y‖²,
Optimal map u = ∇ϕ taking p to q where ϕ is convex.
Brenier proved a weaker result in '87 in a manuscript in French, but later in '91 published the definitive version in Comms. on Pure and Applied Maths.
* It's a wonderful paper and well-worth reading on its own merit & to learn the special cases *
The choice of nonlinear activation functions in neural networks can be tricky and important.
That's because iterating (i.e. repeatedly composing) even simple nonlinear functions can lead to unstable, or even chaotic behavior, even with something as simple as a quadratic.
1/n
Some activations are more well-behaved than others. Take ReLU for example:
r(x) = max{0,x}
its iterates are completely benign r⁽ⁿ⁾(x) = r(x), so we don't have to worry.
Most other activations like soft-plus are less benign, but still change gently with composition.
2/n
Soft-plus:
s(x) = log(eˣ + 1)
has a special property: its n-times self-composition is really simple
s⁽ⁿ⁾(x) = log(eˣ + n)
With each iteration, s⁽ⁿ⁾(x) changes gently for all x.
This form is rare -- most activations don't have a nice closed form iterates like this
3/n
Tweedie's formula is super important in diffusion models & is also one of the cornerstones of empirical Bayes methods.
Given how easy it is to derive, it's surprising how recently it was discovered ('50s). It was published a while later when Tweedie wrote Stein about it
1/n
The MMSE denoiser is known to be the conditional mean f̂(y) = 𝔼(x|y). In this case, we can write the expression for this conditional mean explicitly:
2/n
Note that the normalizing term in the denominator is the marginal density of y.
Images aren’t arbitrary collections of pixels -they have complicated structure, even small ones. That’s why it’s hard to generate images well. Let me give you an idea:
3×3 gray images represented as points in ℝ⁹ lie approximately on a 2-D manifold: the Klein bottle!
1/4
Images can be thought of as vectors in high-dim. It’s been long hypothesized that images live on low-dim manifolds (hence manifold learning). It’s a reasonable assumption: images of the world are not arbitrary. The low-dim structure arises due to physical constraints & laws
2/4
But this doesn’t mean the “low-dimensional” manifold has a simple or intuitive structure -even for tiny images. This classic paper by Gunnar Carlsson gives a lovely overview of the structure of data generally (and images in particular). Worthwhile reading.
Michael Jordan gave a short, excellent, and provocative talk recently in Paris - here's a few key ideas
- It's all just machine learning (ML) - the AI moniker is hype
- The late Dave Rumelhart should've received a Nobel prize for his early ideas on making backprop work
1/n
The "Silicon Valley Fever Dream" is that data will create knowledge, which will lead to super intelligence, and a bunch of people will get very rich.....
2/n
.... yet the true value of technologies like LLMs is that we're getting the benefit of interacting with the collective knowledge of many many individuals - it's not that we will produce one single uber-intelligent being
How are Kernel Smoothing in statistics, Data-Adaptive Filters in image processing, and Attention in Machine Learning related?
I wrote a thread about this late last year. I'll repeat it here and include a link to the slides at the end of the thread.
1/n
In the beginning there was Kernel Regression - a powerful and flexible way to fit an implicit function point-wise to samples. The classic KR is based on interpolation kernels that are a function of the position (x) of the samples and not on the values (y) of the samples.
2/n
Instead of a fixed smoothing parameter h, we can adjusted it dynamically based on the local density of samples near the point of interest. This enables accounting for variations in the spatial distribution of samples, but doesn't take into account of the values of samples
Years ago when my wife and I we were planning to buy a home, my dad stunned me with a quick mental calculation of loan payments.
I asked him how - he said he'd learned the strange formula for compound interest from his father, who was a merchant in 19th century Iran.
1/4
The origins of the formula my dad knew is a mystery, but I know it has been used in the bazaar's of Iran (and elsewhere) for as long as anyone can remember
It has an advantage: it's very easy to compute on an abacus. The exact compounding formula is much more complicated
2/4
I figured out how the two formulae relate: the historical formula is the Taylor series of the exact formula around r=0.
But the crazy thing is that the old Persian formula goes back 100s (maybe 1000s) of years before Taylor's, having been passed down for generations