Peyman Milanfar Profile picture
Distinguished Scientist at Google. Computational Imaging, Machine Learning, and Vision. Tweets = personal opinions. May change or disappear over time.
5 subscribers
Aug 22 5 tweets 4 min read
Yesterday at the @madebygoogle event we launched "Pro Res Zoom" Pixel 10Pro series. I wanted to share a little more detail, some examples and use cases. The feature enables a combined optical + digital zoom up to 100x magnification. It builds on our optical 5x tele camera.

1/n Shooting at mags well above 30x requires that the 5x optical capture be adapted and optimized for such conditions, yielding a high quality crop that's fed to our upscaler. The upscaler is a large enough model to understand some semantic context to try & minimize distortions

2/n Image
Image
Image
Image
Jul 14 5 tweets 2 min read
Receiver Operating Characteristic (ROC) got its name in WWII from Radar, invented to detect enemy aircraft and ships.

I find it much more intuitive than precision/recall. ROC curves show true positive rate vs false positive rate, parametrized by a detection threshold.

1/n ROC curves show the performance tradeoffs in a binary hypothesis test like this:

H₁: signal present
H₀: signal absent

From a data vector x, we can write ROC directly in terms of x. But typically, some T(x) - a test statistic - is computed, and compared to a threshold γ

2/n Image
Apr 30 7 tweets 2 min read
The choice of nonlinear activation functions in neural networks can be tricky and important.

That's because iterating (i.e. repeatedly composing) even simple nonlinear functions can lead to unstable, or even chaotic behavior, even with something as simple as a quadratic.

1/n Image Some activations are more well-behaved than others. Take ReLU for example:

r(x) = max{0,x}

its iterates are completely benign r⁽ⁿ⁾(x) = r(x), so we don't have to worry.

Most other activations like soft-plus are less benign, but still change gently with composition.

2/n
Mar 18 8 tweets 3 min read
Tweedie's formula is super important in diffusion models & is also one of the cornerstones of empirical Bayes methods.

Given how easy it is to derive, it's surprising how recently it was discovered ('50s). It was published a while later when Tweedie wrote Stein about it

1/n Image The MMSE denoiser is known to be the conditional mean f̂(y) = 𝔼(x|y). In this case, we can write the expression for this conditional mean explicitly:

2/n Image
Feb 16 4 tweets 2 min read
Images aren’t arbitrary collections of pixels -they have complicated structure, even small ones. That’s why it’s hard to generate images well. Let me give you an idea:

3×3 gray images represented as points in ℝ⁹ lie approximately on a 2-D manifold: the Klein bottle!

1/4 Image Images can be thought of as vectors in high-dim. It’s been long hypothesized that images live on low-dim manifolds (hence manifold learning). It’s a reasonable assumption: images of the world are not arbitrary. The low-dim structure arises due to physical constraints & laws

2/4 Image
Feb 12 4 tweets 2 min read
The Kalman Filter was once a core topic in EECS curricula. Given its relevance to ML, RL, Ctrl/Robotics, I'm surprised that most researchers don't know much about it - and up rediscovering it. Kalman Filter seems messy & complicated, but the intuition behind it is invaluable

1/4 Image I once had to explain the Kalman Filter in layperson terms in a legal matter with no maths. No problem - I thought. Yet despite being taught the subject by one of the greats (A.S. Willsky) & having taught the subject myself, I found this very difficult to do.

2/4 Image
Feb 8 5 tweets 2 min read
Michael Jordan gave a short, excellent, and provocative talk recently in Paris - here's a few key ideas

- It's all just machine learning (ML) - the AI moniker is hype

- The late Dave Rumelhart should've received a Nobel prize for his early ideas on making backprop work

1/n Image The "Silicon Valley Fever Dream" is that data will create knowledge, which will lead to super intelligence, and a bunch of people will get very rich.....

2/n Image
Jan 26 11 tweets 4 min read
How are Kernel Smoothing in statistics, Data-Adaptive Filters in image processing, and Attention in Machine Learning related?

I wrote a thread about this late last year. I'll repeat it here and include a link to the slides at the end of the thread.

1/n Image In the beginning there was Kernel Regression - a powerful and flexible way to fit an implicit function point-wise to samples. The classic KR is based on interpolation kernels that are a function of the position (x) of the samples and not on the values (y) of the samples.

2/n Image
Jan 25 5 tweets 2 min read
The softmax

σ(z) for z =[z₁,...,zₙ]ᵀ

is defined element-wise as:

σ(zᵢ) = exp(λzᵢ) / Σⱼexp(λzⱼ)

It is a "nice" function w/ very useful smoothness properties. Here are some:

1/4: Softmax is Lipschitz w/ constant λ. Its Hessian & 3rd order derivative are also Lipschitz 2/4: The softmax function is the gradient of the log-sum-exp, lse(z), function:

σ(z) = ∇ lse(z)

where

lse(z) = λ⁻¹ log[ Σⱼexp(λzⱼ) ]
Dec 20, 2024 4 tweets 2 min read
Years ago when my wife and I we were planning to buy a home, my dad stunned me with a quick mental calculation of loan payments.

I asked him how - he said he'd learned the strange formula for compound interest from his father, who was a merchant in 19th century Iran.

1/4 Image The origins of the formula my dad knew is a mystery, but I know it has been used in the bazaar's of Iran (and elsewhere) for as long as anyone can remember

It has an advantage: it's very easy to compute on an abacus. The exact compounding formula is much more complicated

2/4 Image
Dec 8, 2024 11 tweets 4 min read
How are Kernel Smoothing in statistics, Data-Adaptive Filters in image processing, and Attention in Machine Learning related?

My goal is not to argue who should get credit for what, but to show a progression of closely related ideas over time and across neighboring fields.

1/n Image In the beginning there was Kernel Regression - a powerful and flexible way to fit an implicit function point-wise to samples. The classic KR is based on interpolation kernels that are a function of the position (x) of the samples and not on the values (y) of the samples.

2/n Image
Dec 3, 2024 5 tweets 2 min read
“On a log-log plot, my grandmother fits on a straight line.”
-Physicist Fritz Houtermans

There's a lot of truth to this. log-log plots are often abused and can be very misleading

1/5 Image A plot of empirical data can reveal hidden phenomena or scaling. An important and common model is to look for power laws like

p(x) ≃ L(x) xᵃ

where L(x) is slowly varying, so that xᵃ is dominant

Power laws appear all over physics, biology, math, econ. etc., however...

2/5
Nov 10, 2024 6 tweets 3 min read
Integral geometry is a beautiful topic bridging geometry, probability & statistics

Say you have a curve with any shape, possibly even self-intersecting. How can you measure its length?

This has many applications - curve could be a strand of DNA or a twisted length of wire

1/n Image A curve is a collection of tiny segments. Measure each segment & sum. You can go further: make the segments so small they are essentially points, count the red points

A practical way to do this: drop many lines, or a dense grid, intersecting the shape & count intersections

2/n Image
Oct 16, 2024 5 tweets 2 min read
Matrix nearness problems arise often:

Given matrix A, what’s the closest matrix that is, say, symmetric, positive definite, orthogonal, or bi-stochastic?

Symmetry:

There are many ways to symmetrize A: e.g. √AᵀA
For unitarily invariant norms, closest

Aₛ = (Aᵀ + A)/2

1/5
Positive Semi-Definiteness:

In the Frobenius norm (other norms are hard):

Aₚ = (Aₕ + H)/2 is the unique positive approximant of A

where Aₕ = UH is the polar decomp of A

Application:
Make an indefinite Hessian to become positive semi-definite. Useful in optimization

2/5
Sep 24, 2024 4 tweets 2 min read
Smoothing splines fit function to data as the sol'n of a regularized least-squares optimization problem.

But it’s also possible to do it in one shot with an unusually shaped kernel (see figure)

Is it possible to solve other optimization problems this way? Surprisingly yes

1/n Image This is just one instance of how one can “kernelize” an optimization problem. That is, approximate the solution of an optimization problem in just one-step by constructing and applying a kernel once to the input

Given some conditions you can it do much more generally

2/n Image
Sep 18, 2024 4 tweets 2 min read
Mean-shift iteratively moves points towards regions of higher density. It does so by placing a kernel at each data point, calculating the mean of the data points within that window, shifting points towards this mean until convergence: Look familiar?

1/n
(Animation @gabrielpeyre) The first term on the right hand side of the ODE has the form of a pseudo-linear denoiser f(x) = W(x) x. A weighted average of the points where the weights depend on the data. The overall mean-shift process is a lot like a residual flow:

d/dt x(t) = f(x(t)) - x(t)

2/n Image
Sep 5, 2024 8 tweets 4 min read
Random matrices are very important in modern statistics and machine learning, not to mention physics

A model about which much less is known is uniformly sampled matrices from the set of doubly stochastic matrices: Uniformly Distributed Stochastic Matrices

A thread -

1/n
First, what are doubly stochastic matrices?
Non-negative matrices whose row & column sums=1.

The set of doubly stochastic matrices is also known as the Birkhoff polytope: an (n−1)² dimensional convex polytope in ℝⁿˣⁿ with extreme points being permutation matrices.

2/n Image
Sep 1, 2024 10 tweets 3 min read
The perpetually undervalued least-squares:

minₓ‖y−Ax‖²

can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.

Let's assume A is n-by-p. So we have n data points and p parameters

1/10 Image If n ≥ p (“under-fitting” or “over-determined" case) the solution is

x̃ = (AᵀA)⁻¹ Aᵀ y

But if n < p (“over-fitting” or “under-determined” case), there are infinitely many solutions that give *zero* training error. We pick min‖x‖² norm solution:

x̃ = Aᵀ(AAᵀ)⁻¹ y

2/10
Aug 26, 2024 4 tweets 2 min read
There’s a single formula that makes all of your diffusion models possible: Tweedie's

Say 𝐱 is a noisy version of 𝐮 with 𝐞 ∼ 𝒩(𝟎, σ² 𝐈)

𝐱 = 𝐮 + 𝐞

MMSE estimate of 𝐮 is 𝔼[𝐮 | 𝐱] and would seem to require P(𝐮|𝐱). Yet Tweedie says P(𝐱) is all you need

1/3 Image Tweedie is the lynchpin connecting the score function exactly to MMSE denoising residuals like so:

σ²∇logP(𝐱) = σ²∇P(𝐱)/P(𝐱) = 𝔼[𝐮|𝐱] - x

But 𝔼[𝐮|𝐱] is a specific, often inaccessible denoiser. So we replace it with a deep denoiser like UNet, etc. Pretty simple.

2/3
Aug 18, 2024 7 tweets 3 min read
Two basic concepts are often conflated:

Sample Standard Deviation (SD) vs Standard Err (SE)

Say you want to estimate m=𝔼(x) from N independent samples xᵢ. A typical choice is the average or "sample" mean m̂

But how stable is this? That's what Standard Error tells you:

1/6 Image Since m̂ is itself a random variable, we need to quantify the uncertainty around it too: this is what the Standard Error does.

The Standard Error is *not* the same as the spread of the samples - that's the Standard Deviation (SD) - but the two are closely related:

2/6 Image
Aug 15, 2024 13 tweets 9 min read
Did you ever take a photo & wish you'd zoomed in more or framed better? When this happens, we just crop.

Now there's a better way: Zoom Enhance -a new feature my team just shipped on Pixel. Available in Google Photos under Tools, it enhances both zoomed & un-zoomed images

1/n Image Zoom Enhance is our first im-to-im diffusion model designed & optimized to run fully on-device. It allows you to crop or frame the shot you wanted, and enhance it -after capture. The input can be from any device, Pixel or not, old or new. Below are some examples & use cases

2/n

Image
Image
Image