SSIM has become a common loss function in computer vision. It is used to train monocular depth models for self-driving cars, invert GANs, and fit NeRF models to training images. The explosion of SSIM-based models raises a fundamental question: what the hell is SSIM? 🧵
SSIM measures the similarity between two images. Humans are insensitive to the absolute brightness/color of pixels, but very sensitive to the location of edges and textures. SSIM mimics human perception by focusing primarily on edge and textural similarities.
Here’s an example. The contrast adjustment between these two images of #IngridDaubechies makes them 20% different when measured using the 2-norm. But in the SSIM metric they are 98.5% similar (1.5% different).
We compute SSIM by breaking two images into small patches. We then compare the images patch-wise. Given a patch “x” from one image, and the corresponding patch “y” from another, we compute the following summary statistics for the pixel intensities in each patch.
We now compute the luminance similarity between the two patches using the formula below. We get a luminance score near 0 if the brightness of the patches differ greatly, and 1 if they are similar. The formula is scale invariant; multiplying the image by a constant does nothing.
Next we compute the contrast similarity score. This score is 0 if the one patch is much “flatter” than the other, and 1 if both have identical levels of contrast. The contrast score compares the amount of “texture” in the image patches. This formula is also scale invariant.
Finally we compute the structure score, which is the correlation between pixel values in both patches. The score is high when both patches contain an edge with the same location and orientation, but low if the patches disagree on the location of an edge.
The overall SSIM score is the product of these three scores. A practical implementation has some small constants to prevent division by zero. The score is also typically averaged over all image patches. For deep learning, SSIM is also averaged over the R, G, and B channels.
SSIM is used for self-supervised tasks that involve image matching. An example is stereo: one wants to learn a velocity field that maps one image onto another. When one image is mapped through this velocity field, it should "look like" the other, as measured using SSIM.
SSIM, which is short for “structural similarity,” was introduced in a 2004 paper by Zhou, Bovik, Sheikh, and Simoncelli with about 38K citations. The paper saw an upward trend in citation rate starting in 2018, probably because of the popularity of self-supervised tasks.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Unlike O1 and R1, latent reasoning doesn’t need special chain-of-thought training data, and doesn't produce extra CoT tokens at test time.
We trained on 800B tokens 👇
Huginn was built for reasoning from the ground up, not just fine-tuned on CoT.
We built our reasoning system by putting a recurrent block inside the LLM. On a forward pass, we loop this block a random number of times. By looping it more times, we dial up compute.
Recurrence improves reasoning a lot. To show this, we did a comparison with a standard architecture.
We train a standard 3.5B LLM from scratch on 180B tokens. Then we train a recurrent 3.5B model on the same tokens.
LLMs have low randomness: if you ask the same thing twice you get similar responses. Generator prompts are a way to boost the randomness of LLMs.
Using a few generator prompts, I had Gemini write an entire instruction tuning dataset from scratch. It outperform popular datasets.
Let’s start with a toy example of why we need generator prompts. Suppose I want a list of different colors. So I feed this prompt to Gemini 1000 times. This does poorly - I only get 33 unique outputs from 1000 runs. I need more randomness.
A generator prompt asks the model to enumerate a long list of execution paths, and then randomizes which paths get chosen.
Here's an example. The numbers 23 and 76 are randomized each time the prompt is called.
This prompt gives me 782 unique outputs from 1000 runs.
The Llama2 model is pretty impressive. Human evaluators rank it slightly *better* than ChatGPT on a range of things (excluding code and reasoning).
Here's a short TL;DR on what Meta did to improve the state of the art 🧵
Llama1: Small models (7B & 13B) were trained on 1 trillion tokens. Large models saw 1.4T tokens.
Llama2: All models trained on 2T tokens. This means the small models are "over trained" beyond what the scaling laws recommend, resulting in great performance for small models!
As a result of the long training runs, Llama2 beats other major open-source models at most academic benchmarks. Their 7B model is WAY better than other 7B options on all tasks except code.
Nvidia’s AI products follow a weird reverse Moore’s law: every two years, you get half as many FLOPS for your money. This is the opposite of the rest of the chip market 📈
With the H100 release, Nvidia had to reverse course.
A 🧵 on Nvidia losing its grip on the GPU market.
Let’s focus in on the machine learning GPUs. You can see the value drop over time, until the H100 created an uptick. Note: I’m using today’s price for each card, but a similar downward trend also holds for the release prices.
The drop is because of monopoly power and clever market segmentation.
Example: The “server-grade” V100 is a minor variant of the 2080ti gaming card. Nvidia sells it to institutions instead of gamers, charging 5X more for the V100. This means huge profits. lambdalabs.com/blog/best-gpu-…
Training an LLM takes about 1 trillion words. That’s about 30,000 years of typing.
But where does this data come from?
And what does this have to do with the Reddit protests?
Here’s how OpenAI trains models on “the entire internet.” 🧵📜
Much of what we know about OpenAI is from urban legends. But the GPT3 paper does have a table showing their data sources. The cliché that LLMs are trained on “the whole internet” comes from the use of CommonCrawl.
CommonCrawl (CC) is a non-profit that scrapes the internet with bots and tries to record everything since 2008. 90% of CC is HTML, CSS, and scripts. The usable 10% contains junk that needs to be tossed out to clean the dataset.