Peyman Milanfar Profile picture
Nov 4, 2023 5 tweets 3 min read Read on X
The Kalman Filter was once a core topic in EECS curricula. Given it's relevance to ML, RL, Ctrl/Robotics, I'm surprised that most researchers don't know much about it, and many papers just rediscover it. KF seems messy & complicated, but the intuition behind it is invaluable

1/4 Image
I once had to explain the Kalman Filter in layperson terms in a legal matter (no maths!). No problem, I thought. Yet despite being taught the subject by one of the greats (A.S. Willsky) & having taught the subject myself, I found this startlingly difficult to do.

2/4 Image
I was glad to find this little gem. It’s a 24-page writeup that is a great teaching tool, especially in introductory classes, and particularly at the undergraduate level.

The writeup seems to be out of print, but still available (albeit at a rather outrageous price)


3/4
Image
One of the very first applications of the Kalman filter was in aerospace, namely NASA’s early space missions. There’s a wonderful historical account of how the Kalman Filter went from theory to practical tool for both NASA and the aerospace industry.



4/4 ntrs.nasa.gov/api/citations/…
Image
I wrote a thread on sequential estimation (of a constant A, in this toy example) to illustrates the idea. Of course the KF is far more general - it tracks *dynamic* systems where the internal state is itself evolving & subject to uncertainties of its own

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Peyman Milanfar

Peyman Milanfar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @docmilanfar

Nov 10
Integral geometry is a beautiful topic bridging geometry, probability & statistics

Say you have a curve with any shape, possibly even self-intersecting. How can you measure its length?

This has many applications - curve could be a strand of DNA or a twisted length of wire

1/n Image
A curve is a collection of tiny segments. Measure each segment & sum. You can go further: make the segments so small they are essentially points, count the red points

A practical way to do this: drop many lines, or a dense grid, intersecting the shape & count intersections

2/n Image
Curve's length is the sum of intersections n(ρ,θ) of all lines (in polar coords) with the curve (counting multiplicities). This is the beautiful Crofton formula:

Length = 1/2 ∫∫ n(ψ,p) dψ dp

The 1/2 is there because oriented lines are a double cover of un-oriented lines

3/n Image
Read 6 tweets
Sep 24
Smoothing splines fit function to data as the sol'n of a regularized least-squares optimization problem.

But it’s also possible to do it in one shot with an unusually shaped kernel (see figure)

Is it possible to solve other optimization problems this way? Surprisingly yes

1/n Image
This is just one instance of how one can “kernelize” an optimization problem. That is, approximate the solution of an optimization problem in just one-step by constructing and applying a kernel once to the input

Given some conditions you can it do much more generally

2/n Image
If you specialize the regularization to be of the form
φ(x) = ρ( ||Ax|| ) where A= R(|i-j|) is a stationary & isotropic, this gives tidy conversions between φ(x) and the kernel K(x).

3/n Image
Read 4 tweets
Sep 18
Mean-shift iteratively moves points towards regions of higher density. It does so by placing a kernel at each data point, calculating the mean of the data points within that window, shifting points towards this mean until convergence: Look familiar?

1/n
(Animation @gabrielpeyre)
The first term on the right hand side of the ODE has the form of a pseudo-linear denoiser f(x) = W(x) x. A weighted average of the points where the weights depend on the data. The overall mean-shift process is a lot like a residual flow:

d/dt x(t) = f(x(t)) - x(t)

2/n Image
Residual on the RHS is an approximation of the “score” -the gradient of the empirical density of x making it a gradient flow

d/dt x(t) ≈ ∇ log p̂(x(t))

So mean-shift a) estimates the empirical density & b) flows points to nearby peaks. Similarly to flow-matching & InDI

3/n
Read 4 tweets
Sep 5
Random matrices are very important in modern statistics and machine learning, not to mention physics

A model about which much less is known is uniformly sampled matrices from the set of doubly stochastic matrices: Uniformly Distributed Stochastic Matrices

A thread -

1/n
First, what are doubly stochastic matrices?
Non-negative matrices whose row & column sums=1.

The set of doubly stochastic matrices is also known as the Birkhoff polytope: an (n−1)² dimensional convex polytope in ℝⁿˣⁿ with extreme points being permutation matrices.

2/n Image
The extreme points of the Birkhoff polytope (permutations) are sparse matrices, but a typical matrix sampled from inside the polytope is by contrast, very dense

Since rows and columns are exchangeable, the entries of a sampled matrix have the same marginal distribution.

3/n Image
Read 8 tweets
Sep 1
The perpetually undervalued least-squares:

minₓ‖y−Ax‖²

can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.

Let's assume A is n-by-p. So we have n data points and p parameters

1/10 Image
If n ≥ p (“under-fitting” or “over-determined" case) the solution is

x̃ = (AᵀA)⁻¹ Aᵀ y

But if n < p (“over-fitting” or “under-determined” case), there are infinitely many solutions that give *zero* training error. We pick min‖x‖² norm solution:

x̃ = Aᵀ(AAᵀ)⁻¹ y

2/10
In either case, the solution can be compactly written in terms of the SVD of A:

A = USVᵀ

where U & V are orthogonal matrices of size nxn & pxp, and S is nxp & contains i = 1 to k nonzero diag elements

x̃ = ∑ σᵢ⁻¹ vᵢ uᵢᵀ

where σᵢ are the nonzero sing vals of S

3/10 Image
Read 10 tweets
Aug 18
Two basic concepts are often conflated:

Sample Standard Deviation (SD) vs Standard Err (SE)

Say you want to estimate m=𝔼(x) from N independent samples xᵢ. A typical choice is the average or "sample" mean m̂

But how stable is this? That's what Standard Error tells you:

1/6 Image
Since m̂ is itself a random variable, we need to quantify the uncertainty around it too: this is what the Standard Error does.

The Standard Error is *not* the same as the spread of the samples - that's the Standard Deviation (SD) - but the two are closely related:

2/6 Image
But this expression isn't practical because we don't know √var(xᵢ) either

Not knowing √var(xᵢ), we are forced to estimate that too. Here, we typically just plug in the (sample) Standard Deviation for it. Therefore:

Standard Error ≈ Sample Standard Deviation/√N

3/6 Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(