Peyman Milanfar Profile picture
Distinguished Scientist at Google Research. Computational Imaging, Machine Learning, and Vision. Tweets = personal opinions. May change or disappear over time.
Tapan jain Profile picture Zach Bessinger Profile picture Reinaldo F Cristo Profile picture 4 subscribed
Apr 3 5 tweets 2 min read
We often assume bigger generative models are better. But when practical image generation is limited by compute budget is this still true? Answer is no

By looking at latent diffusion models across different scales our paper sheds light on the quality vs model size tradeoffs

1/5 Image We trained a range of txt-2-image LDMs & observed a notable trend: when constrained by compute budget smaller models frequently outperform their larger siblings in image quality. For example the sampling result of a 223M model can be better than results of a model 4x larger

2/5 Image
Apr 2 19 tweets 8 min read
It’s been >20 years since I published my first work on multi-frame super-res (SR) w/ Nhat Nguyen and the late great Gene Golub. Here’s my personal story of SR as I’ve experienced it from theory, to practical algorithms, to deployment in product. In a way it’s been my life’s work Image Tsai and Huang (1984) were the first to publish the concept of multi-frame super-resolution. Key idea was that a high resolution image is related to its shifted and low-resolution versions in the frequency domain through the shift and aliasing properties of the Fourier transform Image
Apr 1 4 tweets 2 min read
Motion blur is often misunderstood, because people think of it in terms of a single imperfect image captured at some instance in time.

But motion blur is in fact an inherently temporal phenomenon. It is a temporal convolution of pixels (at the same location) across time.

1/4 Image Integration across time (eg open shutter) gives motion blur w/ strength depending on the speed of objects

A mix of object speed, shutter speed and frame rate together can cause aliasing in time (spokes moving backwards) & blur in space (wheel surface) all in the same image

2/4
Mar 27 4 tweets 2 min read
This is not a scene from Inception. The sorcery is a real photo was taken with a very long focal length lens. When the focal length is long, the field of view becomes very small and the resulting image appears more flat.

1/4 Image Here's another example:

The Empire State building and the Statue of Liberty are about 4.5 miles apart, and the building is 5x taller.

2/4 Image
Mar 24 5 tweets 3 min read
What is resolution in an image? It is not the number of pixels. Here’s the classical Rayleigh’s criterion taught in basic physics:

1/5 Image This concept is important in imaging because it guides how densely we should pack pixels together to avoid or allow aliasing. (Yes, sometimes aliasing is useful!)

2/5
Image
Image
Mar 12 6 tweets 3 min read
One of the lesser known ways to compare estimators is "admissibility".

An estimator θ* = g(θ,y) of θ from data y is called *in*admissible if g is uniformly dominated by another estimator g(θ,y) for all values of g(θ,y), say in the MSE sense.

1/6 Image Being admissible doesn't mean the estimator is good; but it's a very useful idea to weed out the bad ones.

A great example is Stein's:
The maximum likelihood estimate of Gaussian mean is inadmissible in d≥3. The nonlinear "shrinkage" that pulls y towards origin beats it

2/6 Image
Mar 8 6 tweets 2 min read
The familiar differential expression for the Laplacian doesn’t reveal its true nature: It is really a center-surround operator. This is easy to see in 1D :

1/6 Image The same is true in ℝᵈ :

The Laplacian measures how different the function’s value is at the center of a ball as compared to its local average over the ball.

2/6 Image
Mar 1 9 tweets 3 min read
The perpetually undervalued least-squares:

minₓ‖y−Ax‖²

can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.

Let's assume A is n-by-p. So we have n data points and p parameters

1/9 Image If n≥ p (“under-fitting” or “over-determined" case) solution is

x̃ = (AᵀA)⁻¹ Aᵀ y

But if n < p (“over-fitting” or “under-determined” case), there are infinitely many solutions that give *zero* training error.

We pick min‖x‖² norm solution:

x̃ = Aᵀ(AAᵀ)⁻¹ y

2/9
Feb 25 6 tweets 2 min read
What do polar coordinates, polar matrix factorization, & Helmholz decomposition of a vector field have in common? They’re all implied by Brenier’s Theorem: a cornerstone of Optimal Transport theory. It’s a fundamental decomposition result & deserves to be better known.

1/5 Image Brenier's Thm:
A non-degenerate vector field
u: Ω ∈ ℝⁿ→ℝⁿ has a unique decomposition

u = ∇ϕ∘s

where ϕ is a convex potential on Ω, and s is measure-preserving (e.g. density → density).

Here s is a multi-dimensional “rearrangement” (a sort in 1D)

2/5
Nov 21, 2023 4 tweets 1 min read
How do you move an image by a tiny (sub-pixel) amount?

Sometimes it happens accidentally - for example, if you convolve an image with an even sized kernel (e.g. 2x2), you get a half-pixel shift, whereas convolving with an odd sized kernel (e.g. 3x3) you won’t always get it

1/n
There’s a nice way to move an image by an arbitrary sub-pixel amount that’s based on a classic idea from optical flow.

Suppose we start with an image f(x,y) and want to move by a small amount (δx,δy)

f(x,y) → f(x+δx,y+δy)

2/n
Nov 4, 2023 5 tweets 3 min read
The Kalman Filter was once a core topic in EECS curricula. Given it's relevance to ML, RL, Ctrl/Robotics, I'm surprised that most researchers don't know much about it, and many papers just rediscover it. KF seems messy & complicated, but the intuition behind it is invaluable

1/4 Image I once had to explain the Kalman Filter in layperson terms in a legal matter (no maths!). No problem, I thought. Yet despite being taught the subject by one of the greats (A.S. Willsky) & having taught the subject myself, I found this startlingly difficult to do.

2/4 Image
Nov 3, 2023 6 tweets 3 min read
Take pixels gᵢ=g(xᵢ,yᵢ) of an image as nodes in a weighted, undirected graph. The weights on each edge are the similarity between pixels, measured w/ a sym pos def kernel

k(i,j) =exp[−d(gᵢ,gⱼ)]

g is encoded in K. What can we learn about g from K? Can we get g back from K? Image 2/ Now normalize K to get a doubly-stochastic matrix W, whose rows and cols both sum to one. We do this to get an affinity matrix W that has eigenvalues in [0,1].

The vector c is given by Sinkhorn's method:

for k = 1:n
c = c ./ sqrt(c .* (K*c))
end Image
Nov 1, 2023 6 tweets 3 min read
The Gaussian is a nice bumpy shape, but sometimes we hope for a smooth (i.e. C∞) function like the Gaussian that is 𝒂𝒍𝒔𝒐 compactly supported.

One such class of functions is called "Bump functions"

1/6 Image But why should you care about bumps? Because

* We can make bumps to your specs
* They’re useful as smooth cutoff functions
* Bumps are closed under sum, product & convolution
* Derivative of a bump is another bump
* They build smooth partitions of unity

2/6 Image
Aug 28, 2023 6 tweets 2 min read
Six mathematical objects you don't need

1/6. Klein Wine Bottle
Which you have to pour from the bottom Image 2/6. The Mobius Shower Curtain
You're always on the wrong side Image
Aug 14, 2023 5 tweets 2 min read
All of computing is enabled by integrated circuits.

A key step in making ICs is optical lithography - a process similar to photo printing, used to transfer circuit patterns onto silicon wafers.

Light is shined thru a mask designed to etch a desired pattern on the surface

1/5 Image Sounds simple but it isn't -

When you shine light thru any stencil, the image on the other side isn't the pattern you want - it's "blurry" because diffraction.

So we "pre-warp" the mask: modify it away from an ideal pattern give knowledge of the blurring to come. But how?

2/5 Image
Aug 12, 2023 6 tweets 3 min read
Even technical people get this wrong:

Sample Standard Deviation (SD) vs Standard Err (SE)

You want an estimate m̂ of m=𝔼(x) from N independent samples xᵢ. Typical choice is the average or "sample" mean

How stable is this? The Standard Error (SE) tells how stable it is

1/6 Image But since the estimate m̂ is itself a random variable, we ought to quantify the uncertainty around that too: this is how the SE comes in.

The Standard Error is *not* the same as the spread of the samples - the Standard Deviation (SD); but the two are closely related.

2/6 Image
Aug 11, 2023 5 tweets 3 min read
A typical way to solve image-to-image optimization problems is to use iterative solutions. Is there another way?

Sometimes, you can "kernelize" the problem - that is, approximate the solution in one-step by constructing and applying a kernel filter just once to the input.

1/5 Image The converse works too: Given a 1-shot kernel filter, you can interpret its action as the solution of an optimization problem

Examples are maximum a-posteriori (MAP) or minimum mean-squared error (MMSE) estimates that can be computationally intractable. Consider denoising:

2/5 Image
Jul 27, 2023 7 tweets 3 min read
1/7 The choice of nonlinear activation functions in neural networks can make a big difference. Why?

Because iterating (i.e. repeatedly composing) even simple nonlinear functions can be tricky. Wild, or chaotic behavior can emerge even with something as simple as a quadratic. Image 2/7 Some activations are more well-behaved than others. Take ReLU for example:

r(x) = max{0,x}

its iterates are completely benign r⁽ⁿ⁾(x) = r(x), so we don't have to worry.

Most other activations like soft-plus are less benign, but still change gently with composition.
Jun 25, 2023 4 tweets 2 min read
Years ago when my wife and I we were planning to buy our home, my dad stunned me with a quick mental calculation of loan payments. I asked him how - he said he'd learned the strange formula for compound interest from his father, who was a merchant in 19th century Iran.

🧵 1/4 The origins of the formula my dad knew is a mystery, but I do know it has been used in the bazaar's of Iran (and elsewhere) for as long as anyone can remember

It has an advantage: it's very easy to compute on an abacus. The exact compounding formula is much more complicated

2/4
Jun 19, 2023 5 tweets 3 min read
Receiver Operating Characteristic (ROC) got its name in WWII from Radar, invented to detect enemy aircraft and ships.

ROC curves show true positive rate vs false positive rate, parametrized by a detection threshold.

a small thread

1/n

animation by @dariyasydykova ROC curves show the performance tradeoffs in a binary hypothesis test like this one:

H₁: object present
H₀: object absent

From a data vector x, we can write ROC directly in terms of x. But typically, some T(x) (a test statistic) is computed, and compared to a threshold γ

2/n Image
Apr 20, 2023 9 tweets 3 min read
The perpetually undervalued least-squares:

minₓ‖y−Ax‖²

can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.

Let's assume A is n-by-p. So we have n data points and p parameters

1/n Image If n ≥ p (“under-fitting” or “over-determined" case) solution is

x̃ = (AᵀA)⁻¹ Aᵀ y

But if n < p (“over-fitting” or “under-determined” case), there are infinitely many solutions that give *zero* training error.

We pick min‖x‖² norm solution:

x̃ = Aᵀ(AAᵀ)⁻¹ y 2/n

2/n