SSIM has become a common loss function in computer vision. It is used to train monocular depth models for self-driving cars, invert GANs, and fit NeRF models to training images. The explosion of SSIM-based models raises a fundamental question: what the hell is SSIM? 🧵
SSIM measures the similarity between two images. Humans are insensitive to the absolute brightness/color of pixels, but very sensitive to the location of edges and textures. SSIM mimics human perception by focusing primarily on edge and textural similarities.
Here’s an example. The contrast adjustment between these two images of #IngridDaubechies makes them 20% different when measured using the 2-norm. But in the SSIM metric they are 98.5% similar (1.5% different).
We compute SSIM by breaking two images into small patches. We then compare the images patch-wise. Given a patch “x” from one image, and the corresponding patch “y” from another, we compute the following summary statistics for the pixel intensities in each patch.
We now compute the luminance similarity between the two patches using the formula below. We get a luminance score near 0 if the brightness of the patches differ greatly, and 1 if they are similar. The formula is scale invariant; multiplying the image by a constant does nothing.
Next we compute the contrast similarity score. This score is 0 if the one patch is much “flatter” than the other, and 1 if both have identical levels of contrast. The contrast score compares the amount of “texture” in the image patches. This formula is also scale invariant.
Finally we compute the structure score, which is the correlation between pixel values in both patches. The score is high when both patches contain an edge with the same location and orientation, but low if the patches disagree on the location of an edge.
The overall SSIM score is the product of these three scores. A practical implementation has some small constants to prevent division by zero. The score is also typically averaged over all image patches. For deep learning, SSIM is also averaged over the R, G, and B channels.
SSIM is used for self-supervised tasks that involve image matching. An example is stereo: one wants to learn a velocity field that maps one image onto another. When one image is mapped through this velocity field, it should "look like" the other, as measured using SSIM.
SSIM, which is short for “structural similarity,” was introduced in a 2004 paper by Zhou, Bovik, Sheikh, and Simoncelli with about 38K citations. The paper saw an upward trend in citation rate starting in 2018, probably because of the popularity of self-supervised tasks.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tom Goldstein

Tom Goldstein Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tomgoldsteincs

Jul 5
Just how much have language models grown in the last 4 years? Let's have a look. In 2018, the puny BERT “large” model premiered with a measly 354M parameters. It can be trained on a single 8xA100 node in 5 days. That costs $2K on AWS - almost free by LLM standards! 🧵
Then came Facebook’s equally tiny RoBERTa model. Built on BERT-large, but with mods for faster mixed-precision training, it completed 40 epochs on its beefed up training set using 1000 GPUs for a week. You could train this on the cloud for $350K. NBD.
GPT-3 has a modest but respectable 175B parameters. It was trained with roughly 1500 GPUs for 2 months. On AWS, you could train this for a cool $3M.
Read 9 tweets
Feb 4
"Plug-In" inversion directly produces images from ViTs and CNNs at the pixel level, with no GAN prior. We then see what networks really care about, not just what the GANs want us to see. Here's a few examples. First, I'll pull you in with these tugboats...
My student @aminjuun has been working like a dog on this project. This dog, specifically.
Lately, it feels like there's been a volcano of research on vision transformers. Here's what the ViTs think of that...
Read 4 tweets
Jan 21
My recent talk at the NSF town hall focused on the history of the AI winters, how the ML community became "anti-science," and whether the rejection of science will cause a winter for ML theory. I'll summarize these issues below...🧵
Frank Rosenblatt's hardware implementation of perceptrons solved very simple OCR problems. After it was proved that shallow perceptrons could not solve certain logic problems, the community soured on this approach, causing the winter of '69. Image
This caused a turning away from vision problems and towards text systems (e.g. ELIZA) and planning (e.g. A* search). In 1973, James Lighthill wrote a report for the British government claiming that progress on language systems and robotics had stalled, causing the second winter. Image
Read 20 tweets
Jan 7
If you want to understand why TensorFlow is the way it is, you have to go back to the ancient times. In 2012, Google created a system called DistBelief that laid out their vision for how large-scale training would work. It was the basis for TF. 🧵
research.google/pubs/pub40565/
In DistBelief, both models and datasets were split across nodes. Worker nodes update only a subset of parameters at a time, and communicate parameters asyncronously to a "parameter server". A "coordinator" orchestrates the independent model, data, and parameter nodes.
Here's a description of the *simplest* operation mode of this system, taken directly from Jeff's paper.
Read 8 tweets
Jul 9, 2021
My lab is building neural nets that emulate human “thinking” processes. They increase their intelligence after training by increasing their compute budget. By doing so, a net trained only on “easy” chess puzzles can solve “hard” chess puzzles without having ever seen one…
Deep thinking networks are inspired by the human mind: they represent problems in their working memory, and then iteratively apply a recurrent unit to manipulate the representation until the problem is solved. The recurrent unit encodes scalable problem solving algorithm.
After training on small/easy problems, the power of the network can be increased at test time just by turning up the number of iterations of the recurrent unit. By “thinking for longer” the network assembles its knowledge into more complex strategies.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(