Micah Goldblum Profile picture
Apr 20 17 tweets 4 min read Twitter logo Read on Twitter
🚨Here’s an intuitive explanation for why training on lots and lots of data creates emergent properties, for instance math and reasoning, in large language models like #GPT-4 and #ChatGPT 🚨 1/17
Let’s start with the basics. Real-world data is full of patterns and structure. This structure allows us to describe things with simple rules. We exploit this fact all the time, for example to derive laws of physics or differential equations. 2/17
Complex physical phenomena can be described by surprisingly short mathematical equations. Reasoning and human language also follow rules. All kinds of datasets, from images to text, contain patterns which allow us to compress them. 3/17
So real-world stuff, from datasets to math, can be described succinctly. Now, imagine I give you a few data points, say with labels generated by some chaotic function f. If I ask you to write a short program that outputs their values, you’d write up a small lookup table. 4/17
However, if I gave you enough data points, eventually it’s much much shorter for you to just write down the function f itself. Short and efficient descriptions represent the core underlying structure rather than rote memorization. 5/17
ML problems, like classification or generation, can be performed by surprisingly short programs. When your training dataset is small, it’s easy to memorize the training data, just like a lookup table. You can think of lookup tables as memorization. 6/17
But when the dataset gets huge, it’s actually less complex to just learn the real rules than it is to memorize! Processes like reasoning or writing text are highly structured and follow patterns. So if we fit our training data efficiently, we’ll learn those structures too. 7/17
Here are two ingredients our models need in order to learn these structures: (1) They need to have the power to represent the structure. (2) Importantly, they need to have a simplicity bias that prefers efficient (low-complexity) solutions. 8/17
If they fail (1), then they are hopeless to learn math or reasoning, so we need flexible models. For example, current architectures can’t perform multiplication well. If they fail (2), then they will just memorize their training data without learning anything of substance. 9/17
People often think that while bigger LLMs may better satisfy (1), they must also have an easier time memorizing or overfitting and hence fail (2). In reality, it’s entirely possible for a larger model to have an even stronger low-complexity bias than a smaller one! 10/17
It turns out that neural networks of all sorts are strongly biased towards simple, or compressible, functions. There is a whole field of PAC-Bayes generalization theory built around this idea which can mathematically explain why neural networks perform so well. 11/17
In fact, neural networks (or any other model for that matter) which are sufficiently compressible are formally guaranteed to generalize well to new and unseen test samples. 12/17
arxiv.org/abs/2211.13609
In the future, we need to design models that can efficiently represent patterns like mathematics, reasoning, and language, and they will *automatically* learn these patterns from large-scale datasets. Our current models are inefficient for lots of things. 13/17
Some problems like multiplication cannot be solved efficiently by transformers, and our current LLMs use an amount of compute dependent only on input/output lengths, preventing them from solving problems like chess where a small board requires tons of compute to solve. 14/17
Nonetheless, even previous generations of LLMs learned complex structures beyond memorizing their training data, such as in-context learning and induction heads. 15/17
transformer-circuits.pub/2022/in-contex…
Check out our paper, with @m_finzi, Keefer Rowan, @andrewgwils, where we show just how important simplicity bias, formalized using Kolmogorov complexity, is for machine learning. The paper is easy to approach for all audiences! 16/17
arxiv.org/abs/2304.05366
With enough text, the shortest program to reproduce it is one that first builds an AGI to solve the problem, and then uses it to solve it! 🖐️🎤 17/17

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Micah Goldblum

Micah Goldblum Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @micahgoldblum

Nov 12, 2022
The following statement, while a commonly held view, is actually false! “Learning theory says that the more functions your model can represent, the more samples it needs to learn anything”. 1/8
While it is true that a model which can only express few functions needs few samples to learn, the converse is not true! This underscores the failure of ideas like VC dimension and Rademacher complexity to explain neural network generalization. 2/8
Priors, i.e. inductive biases, need not restrict the function class at all. We actually rely on this principle all the time! Flexible neural networks prefer simple functions, even though they can express complex ones, which allows them to generalize. 3/8
Read 8 tweets
Oct 13, 2022
How much data are augmentations worth? We show that augmentations can actually be worth more than extra data and invariance! They increase variance across batches, and this extra stochasticity finds flatter minima. arxiv.org/abs/2210.06441 1/8
As we gather more and more data, if we train without augmentations, we expect to saturate the performance of our model. This is not true under data augmentations! If augmentations are inconsistent with the data distribution, we will never overcome them. 2/8
Data augmentations like horizontal flips and random crops or fancy policies like AutoAugment are all known to improve accuracy, but they behave differently as you scale the size of the training set. 3/8
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(