Distinguished Scientist at Google. Computational Imaging, Machine Learning, and Vision. Tweets = personal opinions. May change or disappear over time.
5 subscribers
Dec 20, 2024 • 4 tweets • 2 min read
Years ago when my wife and I we were planning to buy a home, my dad stunned me with a quick mental calculation of loan payments.
I asked him how - he said he'd learned the strange formula for compound interest from his father, who was a merchant in 19th century Iran.
1/4
The origins of the formula my dad knew is a mystery, but I know it has been used in the bazaar's of Iran (and elsewhere) for as long as anyone can remember
It has an advantage: it's very easy to compute on an abacus. The exact compounding formula is much more complicated
2/4
Dec 8, 2024 • 11 tweets • 4 min read
How are Kernel Smoothing in statistics, Data-Adaptive Filters in image processing, and Attention in Machine Learning related?
My goal is not to argue who should get credit for what, but to show a progression of closely related ideas over time and across neighboring fields.
1/n
In the beginning there was Kernel Regression - a powerful and flexible way to fit an implicit function point-wise to samples. The classic KR is based on interpolation kernels that are a function of the position (x) of the samples and not on the values (y) of the samples.
2/n
Dec 3, 2024 • 5 tweets • 2 min read
“On a log-log plot, my grandmother fits on a straight line.”
-Physicist Fritz Houtermans
There's a lot of truth to this. log-log plots are often abused and can be very misleading
1/5
A plot of empirical data can reveal hidden phenomena or scaling. An important and common model is to look for power laws like
p(x) ≃ L(x) xᵃ
where L(x) is slowly varying, so that xᵃ is dominant
Power laws appear all over physics, biology, math, econ. etc., however...
2/5
Nov 10, 2024 • 6 tweets • 3 min read
Integral geometry is a beautiful topic bridging geometry, probability & statistics
Say you have a curve with any shape, possibly even self-intersecting. How can you measure its length?
This has many applications - curve could be a strand of DNA or a twisted length of wire
1/n
A curve is a collection of tiny segments. Measure each segment & sum. You can go further: make the segments so small they are essentially points, count the red points
A practical way to do this: drop many lines, or a dense grid, intersecting the shape & count intersections
2/n
Sep 24, 2024 • 4 tweets • 2 min read
Smoothing splines fit function to data as the sol'n of a regularized least-squares optimization problem.
But it’s also possible to do it in one shot with an unusually shaped kernel (see figure)
Is it possible to solve other optimization problems this way? Surprisingly yes
1/n
This is just one instance of how one can “kernelize” an optimization problem. That is, approximate the solution of an optimization problem in just one-step by constructing and applying a kernel once to the input
Given some conditions you can it do much more generally
2/n
Sep 18, 2024 • 4 tweets • 2 min read
Mean-shift iteratively moves points towards regions of higher density. It does so by placing a kernel at each data point, calculating the mean of the data points within that window, shifting points towards this mean until convergence: Look familiar?
1/n (Animation @gabrielpeyre)
The first term on the right hand side of the ODE has the form of a pseudo-linear denoiser f(x) = W(x) x. A weighted average of the points where the weights depend on the data. The overall mean-shift process is a lot like a residual flow:
d/dt x(t) = f(x(t)) - x(t)
2/n
Sep 5, 2024 • 8 tweets • 4 min read
Random matrices are very important in modern statistics and machine learning, not to mention physics
A model about which much less is known is uniformly sampled matrices from the set of doubly stochastic matrices: Uniformly Distributed Stochastic Matrices
A thread -
1/n
First, what are doubly stochastic matrices?
Non-negative matrices whose row & column sums=1.
The set of doubly stochastic matrices is also known as the Birkhoff polytope: an (n−1)² dimensional convex polytope in ℝⁿˣⁿ with extreme points being permutation matrices.
2/n
Sep 1, 2024 • 10 tweets • 3 min read
The perpetually undervalued least-squares:
minₓ‖y−Ax‖²
can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.
Let's assume A is n-by-p. So we have n data points and p parameters
1/10
If n ≥ p (“under-fitting” or “over-determined" case) the solution is
x̃ = (AᵀA)⁻¹ Aᵀ y
But if n < p (“over-fitting” or “under-determined” case), there are infinitely many solutions that give *zero* training error. We pick min‖x‖² norm solution:
x̃ = Aᵀ(AAᵀ)⁻¹ y
2/10
Aug 18, 2024 • 7 tweets • 3 min read
Two basic concepts are often conflated:
Sample Standard Deviation (SD) vs Standard Err (SE)
Say you want to estimate m=𝔼(x) from N independent samples xᵢ. A typical choice is the average or "sample" mean m̂
But how stable is this? That's what Standard Error tells you:
1/6
Since m̂ is itself a random variable, we need to quantify the uncertainty around it too: this is what the Standard Error does.
The Standard Error is *not* the same as the spread of the samples - that's the Standard Deviation (SD) - but the two are closely related:
2/6
Aug 15, 2024 • 13 tweets • 9 min read
Did you ever take a photo & wish you'd zoomed in more or framed better? When this happens, we just crop.
Now there's a better way: Zoom Enhance -a new feature my team just shipped on Pixel. Available in Google Photos under Tools, it enhances both zoomed & un-zoomed images
1/n
Zoom Enhance is our first im-to-im diffusion model designed & optimized to run fully on-device. It allows you to crop or frame the shot you wanted, and enhance it -after capture. The input can be from any device, Pixel or not, old or new. Below are some examples & use cases
2/n
Aug 10, 2024 • 7 tweets • 3 min read
Image-to-image models have been called 'filters' since the early days of comp vision/imaging. But what does it mean to filter an image?
If we choose some set of weights and apply them to the input image, what loss/objective function does this process optimize (if any)?
1/7
Such filters can often be written as matrix-vector operations. Think of z, y, and the corresponding weights as vectors and you have a tidy expression relating (all) output pixels to (all) input pixels. If the filter is local (has a small footprint), most weight will be zero.
2/7
Jul 21, 2024 • 4 tweets • 2 min read
Images aren’t arbitrary collections of pixels -they have complicated structure, even small ones. That’s why it’s hard to generate images well. Let me give you an idea:
3×3 gray images represented as points in ℝ⁹ lie approximately on a 2-D manifold: the Klein bottle!
1/3
Images can be thought of as vectors in high-dim. It’s been long hypothesized that images live on low-dim manifolds (hence manifold learning). It’s a reasonable assumption: images of the world are not arbitrary. The low-dim structure arises due to physical constraints & laws
2/3
Apr 3, 2024 • 5 tweets • 2 min read
We often assume bigger generative models are better. But when practical image generation is limited by compute budget is this still true? Answer is no
By looking at latent diffusion models across different scales our paper sheds light on the quality vs model size tradeoffs
1/5
We trained a range of txt-2-image LDMs & observed a notable trend: when constrained by compute budget smaller models frequently outperform their larger siblings in image quality. For example the sampling result of a 223M model can be better than results of a model 4x larger
2/5
Apr 2, 2024 • 19 tweets • 8 min read
It’s been >20 years since I published my first work on multi-frame super-res (SR) w/ Nhat Nguyen and the late great Gene Golub. Here’s my personal story of SR as I’ve experienced it from theory, to practical algorithms, to deployment in product. In a way it’s been my life’s work
Tsai and Huang (1984) were the first to publish the concept of multi-frame super-resolution. Key idea was that a high resolution image is related to its shifted and low-resolution versions in the frequency domain through the shift and aliasing properties of the Fourier transform
Apr 1, 2024 • 4 tweets • 2 min read
Motion blur is often misunderstood, because people think of it in terms of a single imperfect image captured at some instance in time.
But motion blur is in fact an inherently temporal phenomenon. It is a temporal convolution of pixels (at the same location) across time.
1/4
Integration across time (eg open shutter) gives motion blur w/ strength depending on the speed of objects
A mix of object speed, shutter speed and frame rate together can cause aliasing in time (spokes moving backwards) & blur in space (wheel surface) all in the same image
2/4
Mar 27, 2024 • 4 tweets • 2 min read
This is not a scene from Inception. The sorcery is a real photo was taken with a very long focal length lens. When the focal length is long, the field of view becomes very small and the resulting image appears more flat.
1/4
Here's another example:
The Empire State building and the Statue of Liberty are about 4.5 miles apart, and the building is 5x taller.
2/4
Mar 24, 2024 • 5 tweets • 3 min read
What is resolution in an image? It is not the number of pixels. Here’s the classical Rayleigh’s criterion taught in basic physics:
1/5
This concept is important in imaging because it guides how densely we should pack pixels together to avoid or allow aliasing. (Yes, sometimes aliasing is useful!)
2/5
Mar 12, 2024 • 6 tweets • 3 min read
One of the lesser known ways to compare estimators is "admissibility".
An estimator θ* = g(θ,y) of θ from data y is called *in*admissible if g is uniformly dominated by another estimator g(θ,y) for all values of g(θ,y), say in the MSE sense.
1/6
Being admissible doesn't mean the estimator is good; but it's a very useful idea to weed out the bad ones.
A great example is Stein's:
The maximum likelihood estimate of Gaussian mean is inadmissible in d≥3. The nonlinear "shrinkage" that pulls y towards origin beats it
2/6
Mar 8, 2024 • 6 tweets • 2 min read
The familiar differential expression for the Laplacian doesn’t reveal its true nature: It is really a center-surround operator. This is easy to see in 1D :
1/6
The same is true in ℝᵈ :
The Laplacian measures how different the function’s value is at the center of a ball as compared to its local average over the ball.
2/6
Mar 1, 2024 • 9 tweets • 3 min read
The perpetually undervalued least-squares:
minₓ‖y−Ax‖²
can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.
Let's assume A is n-by-p. So we have n data points and p parameters
1/9
If n≥ p (“under-fitting” or “over-determined" case) solution is
x̃ = (AᵀA)⁻¹ Aᵀ y
But if n < p (“over-fitting” or “under-determined” case), there are infinitely many solutions that give *zero* training error.
We pick min‖x‖² norm solution:
x̃ = Aᵀ(AAᵀ)⁻¹ y
2/9
Feb 25, 2024 • 6 tweets • 2 min read
What do polar coordinates, polar matrix factorization, & Helmholz decomposition of a vector field have in common? They’re all implied by Brenier’s Theorem: a cornerstone of Optimal Transport theory. It’s a fundamental decomposition result & deserves to be better known.
1/5
Brenier's Thm:
A non-degenerate vector field
u: Ω ∈ ℝⁿ→ℝⁿ has a unique decomposition
u = ∇ϕ∘s
where ϕ is a convex potential on Ω, and s is measure-preserving (e.g. density → density).
Here s is a multi-dimensional “rearrangement” (a sort in 1D)
2/5