Latest Twitter Threads by @docmilanfar on Thread Reader App

Apr 30 • 7 tweets • 2 min read

The choice of nonlinear activation functions in neural networks can be tricky and important.

That's because iterating (i.e. repeatedly composing) even simple nonlinear functions can lead to unstable, or even chaotic behavior, even with something as simple as a quadratic.

1/n

Some activations are more well-behaved than others. Take ReLU for example:

r(x) = max{0,x}

its iterates are completely benign r⁽ⁿ⁾(x) = r(x), so we don't have to worry.

Most other activations like soft-plus are less benign, but still change gently with composition.

2/n

Mar 18 • 8 tweets • 3 min read

Tweedie's formula is super important in diffusion models & is also one of the cornerstones of empirical Bayes methods.

Given how easy it is to derive, it's surprising how recently it was discovered ('50s). It was published a while later when Tweedie wrote Stein about it

1/n

The MMSE denoiser is known to be the conditional mean f̂(y) = 𝔼(x|y). In this case, we can write the expression for this conditional mean explicitly:

2/n

Feb 16 • 4 tweets • 2 min read

Images aren’t arbitrary collections of pixels -they have complicated structure, even small ones. That’s why it’s hard to generate images well. Let me give you an idea:

3×3 gray images represented as points in ℝ⁹ lie approximately on a 2-D manifold: the Klein bottle!

1/4

Images can be thought of as vectors in high-dim. It’s been long hypothesized that images live on low-dim manifolds (hence manifold learning). It’s a reasonable assumption: images of the world are not arbitrary. The low-dim structure arises due to physical constraints & laws

2/4

Feb 8 • 5 tweets • 2 min read

Michael Jordan gave a short, excellent, and provocative talk recently in Paris - here's a few key ideas

- It's all just machine learning (ML) - the AI moniker is hype

- The late Dave Rumelhart should've received a Nobel prize for his early ideas on making backprop work

1/n

The "Silicon Valley Fever Dream" is that data will create knowledge, which will lead to super intelligence, and a bunch of people will get very rich.....

2/n

Jan 26 • 11 tweets • 4 min read

How are Kernel Smoothing in statistics, Data-Adaptive Filters in image processing, and Attention in Machine Learning related?

I wrote a thread about this late last year. I'll repeat it here and include a link to the slides at the end of the thread.

1/n

In the beginning there was Kernel Regression - a powerful and flexible way to fit an implicit function point-wise to samples. The classic KR is based on interpolation kernels that are a function of the position (x) of the samples and not on the values (y) of the samples.

2/n

Dec 20, 2024 • 4 tweets • 2 min read

Years ago when my wife and I we were planning to buy a home, my dad stunned me with a quick mental calculation of loan payments.

I asked him how - he said he'd learned the strange formula for compound interest from his father, who was a merchant in 19th century Iran.

1/4

The origins of the formula my dad knew is a mystery, but I know it has been used in the bazaar's of Iran (and elsewhere) for as long as anyone can remember

It has an advantage: it's very easy to compute on an abacus. The exact compounding formula is much more complicated

2/4

Dec 8, 2024 • 11 tweets • 4 min read

How are Kernel Smoothing in statistics, Data-Adaptive Filters in image processing, and Attention in Machine Learning related?

My goal is not to argue who should get credit for what, but to show a progression of closely related ideas over time and across neighboring fields.

1/n

Dec 3, 2024 • 5 tweets • 2 min read

“On a log-log plot, my grandmother fits on a straight line.”
-Physicist Fritz Houtermans

There's a lot of truth to this. log-log plots are often abused and can be very misleading

1/5

A plot of empirical data can reveal hidden phenomena or scaling. An important and common model is to look for power laws like

p(x) ≃ L(x) xᵃ

where L(x) is slowly varying, so that xᵃ is dominant

Power laws appear all over physics, biology, math, econ. etc., however...

2/5

Nov 10, 2024 • 6 tweets • 3 min read

Integral geometry is a beautiful topic bridging geometry, probability & statistics

Say you have a curve with any shape, possibly even self-intersecting. How can you measure its length?

This has many applications - curve could be a strand of DNA or a twisted length of wire

1/n

A curve is a collection of tiny segments. Measure each segment & sum. You can go further: make the segments so small they are essentially points, count the red points

A practical way to do this: drop many lines, or a dense grid, intersecting the shape & count intersections

2/n

Oct 16, 2024 • 5 tweets • 2 min read

Matrix nearness problems arise often:

Given matrix A, what’s the closest matrix that is, say, symmetric, positive definite, orthogonal, or bi-stochastic?

Symmetry:

There are many ways to symmetrize A: e.g. √AᵀA
For unitarily invariant norms, closest

Aₛ = (Aᵀ + A)/2

1/5 Positive Semi-Definiteness:

In the Frobenius norm (other norms are hard):

Aₚ = (Aₕ + H)/2 is the unique positive approximant of A

where Aₕ = UH is the polar decomp of A

Application:
Make an indefinite Hessian to become positive semi-definite. Useful in optimization

2/5

Sep 24, 2024 • 4 tweets • 2 min read

Smoothing splines fit function to data as the sol'n of a regularized least-squares optimization problem.

But it’s also possible to do it in one shot with an unusually shaped kernel (see figure)

Is it possible to solve other optimization problems this way? Surprisingly yes

1/n

This is just one instance of how one can “kernelize” an optimization problem. That is, approximate the solution of an optimization problem in just one-step by constructing and applying a kernel once to the input

Given some conditions you can it do much more generally

2/n

Sep 18, 2024 • 4 tweets • 2 min read

Mean-shift iteratively moves points towards regions of higher density. It does so by placing a kernel at each data point, calculating the mean of the data points within that window, shifting points towards this mean until convergence: Look familiar?

1/n
(Animation @gabrielpeyre)

The first term on the right hand side of the ODE has the form of a pseudo-linear denoiser f(x) = W(x) x. A weighted average of the points where the weights depend on the data. The overall mean-shift process is a lot like a residual flow:

d/dt x(t) = f(x(t)) - x(t)

2/n

Sep 5, 2024 • 8 tweets • 4 min read

Random matrices are very important in modern statistics and machine learning, not to mention physics

A model about which much less is known is uniformly sampled matrices from the set of doubly stochastic matrices: Uniformly Distributed Stochastic Matrices

A thread -

1/n First, what are doubly stochastic matrices?
Non-negative matrices whose row & column sums=1.

The set of doubly stochastic matrices is also known as the Birkhoff polytope: an (n−1)² dimensional convex polytope in ℝⁿˣⁿ with extreme points being permutation matrices.

2/n

Sep 1, 2024 • 10 tweets • 3 min read

The perpetually undervalued least-squares:

minₓ‖y−Ax‖²

can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.

Let's assume A is n-by-p. So we have n data points and p parameters

1/10

If n ≥ p (“under-fitting” or “over-determined" case) the solution is

x̃ = (AᵀA)⁻¹ Aᵀ y

But if n < p (“over-fitting” or “under-determined” case), there are infinitely many solutions that give *zero* training error. We pick min‖x‖² norm solution:

x̃ = Aᵀ(AAᵀ)⁻¹ y

2/10

Aug 26, 2024 • 4 tweets • 2 min read

There’s a single formula that makes all of your diffusion models possible: Tweedie's

Say 𝐱 is a noisy version of 𝐮 with 𝐞 ∼ 𝒩(𝟎, σ² 𝐈)

𝐱 = 𝐮 + 𝐞

MMSE estimate of 𝐮 is 𝔼[𝐮 | 𝐱] and would seem to require P(𝐮|𝐱). Yet Tweedie says P(𝐱) is all you need

1/3

Tweedie is the lynchpin connecting the score function exactly to MMSE denoising residuals like so:

σ²∇logP(𝐱) = σ²∇P(𝐱)/P(𝐱) = 𝔼[𝐮|𝐱] - x

But 𝔼[𝐮|𝐱] is a specific, often inaccessible denoiser. So we replace it with a deep denoiser like UNet, etc. Pretty simple.

2/3

Aug 18, 2024 • 7 tweets • 3 min read

Two basic concepts are often conflated:

Sample Standard Deviation (SD) vs Standard Err (SE)

Say you want to estimate m=𝔼(x) from N independent samples xᵢ. A typical choice is the average or "sample" mean m̂

But how stable is this? That's what Standard Error tells you:

1/6

Since m̂ is itself a random variable, we need to quantify the uncertainty around it too: this is what the Standard Error does.

The Standard Error is *not* the same as the spread of the samples - that's the Standard Deviation (SD) - but the two are closely related:

2/6

Aug 15, 2024 • 13 tweets • 9 min read

Did you ever take a photo & wish you'd zoomed in more or framed better? When this happens, we just crop.

Now there's a better way: Zoom Enhance -a new feature my team just shipped on Pixel. Available in Google Photos under Tools, it enhances both zoomed & un-zoomed images

1/n

Zoom Enhance is our first im-to-im diffusion model designed & optimized to run fully on-device. It allows you to crop or frame the shot you wanted, and enhance it -after capture. The input can be from any device, Pixel or not, old or new. Below are some examples & use cases

2/n

Aug 10, 2024 • 7 tweets • 3 min read

Image-to-image models have been called 'filters' since the early days of comp vision/imaging. But what does it mean to filter an image?

If we choose some set of weights and apply them to the input image, what loss/objective function does this process optimize (if any)?

1/7

Such filters can often be written as matrix-vector operations. Think of z, y, and the corresponding weights as vectors and you have a tidy expression relating (all) output pixels to (all) input pixels. If the filter is local (has a small footprint), most weight will be zero.

2/7

Jul 21, 2024 • 4 tweets • 2 min read

Apr 3, 2024 • 5 tweets • 2 min read

We often assume bigger generative models are better. But when practical image generation is limited by compute budget is this still true? Answer is no

By looking at latent diffusion models across different scales our paper sheds light on the quality vs model size tradeoffs

1/5

We trained a range of txt-2-image LDMs & observed a notable trend: when constrained by compute budget smaller models frequently outperform their larger siblings in image quality. For example the sampling result of a 223M model can be better than results of a model 4x larger

2/5

Apr 2, 2024 • 19 tweets • 8 min read

It’s been >20 years since I published my first work on multi-frame super-res (SR) w/ Nhat Nguyen and the late great Gene Golub. Here’s my personal story of SR as I’ve experienced it from theory, to practical algorithms, to deployment in product. In a way it’s been my life’s work

Tsai and Huang (1984) were the first to publish the concept of multi-frame super-resolution. Key idea was that a high resolution image is related to its shifted and low-resolution versions in the frequency domain through the shift and aliasing properties of the Fourier transform

Share this page!

Enter URL or ID to Unroll