Thread by @tomgoldsteincs on Thread Reader App

SSIM has become a common loss function in computer vision. It is used to train monocular depth models for self-driving cars, invert GANs, and fit NeRF models to training images. The explosion of SSIM-based models raises a fundamental question: what the hell is SSIM? 🧵

SSIM measures the similarity between two images. Humans are insensitive to the absolute brightness/color of pixels, but very sensitive to the location of edges and textures. SSIM mimics human perception by focusing primarily on edge and textural similarities.

Here’s an example. The contrast adjustment between these two images of #IngridDaubechies makes them 20% different when measured using the 2-norm. But in the SSIM metric they are 98.5% similar (1.5% different).

We compute SSIM by breaking two images into small patches. We then compare the images patch-wise. Given a patch “x” from one image, and the corresponding patch “y” from another, we compute the following summary statistics for the pixel intensities in each patch.

We now compute the luminance similarity between the two patches using the formula below. We get a luminance score near 0 if the brightness of the patches differ greatly, and 1 if they are similar. The formula is scale invariant; multiplying the image by a constant does nothing.

Next we compute the contrast similarity score. This score is 0 if the one patch is much “flatter” than the other, and 1 if both have identical levels of contrast. The contrast score compares the amount of “texture” in the image patches. This formula is also scale invariant.

Finally we compute the structure score, which is the correlation between pixel values in both patches. The score is high when both patches contain an edge with the same location and orientation, but low if the patches disagree on the location of an edge.

The overall SSIM score is the product of these three scores. A practical implementation has some small constants to prevent division by zero. The score is also typically averaged over all image patches. For deep learning, SSIM is also averaged over the R, G, and B channels.

SSIM is used for self-supervised tasks that involve image matching. An example is stereo: one wants to learn a velocity field that maps one image onto another. When one image is mapped through this velocity field, it should "look like" the other, as measured using SSIM.

SSIM, which is short for “structural similarity,” was introduced in a 2004 paper by Zhou, Bovik, Sheikh, and Simoncelli with about 38K citations. The paper saw an upward trend in citation rate starting in 2018, probably because of the popularity of self-supervised tasks.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll