Tweet

How to get URL link on Twitter App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Micah Goldblum

@micahgoldblum

Apr 20 • 17 tweets • 4 min read Twitter logo

Read on Twitter

🚨Here’s an intuitive explanation for why training on lots and lots of data creates emergent properties, for instance math and reasoning, in large language models like #GPT-4 and #ChatGPT 🚨 1/17

Let’s start with the basics. Real-world data is full of patterns and structure. This structure allows us to describe things with simple rules. We exploit this fact all the time, for example to derive laws of physics or differential equations. 2/17

Complex physical phenomena can be described by surprisingly short mathematical equations. Reasoning and human language also follow rules. All kinds of datasets, from images to text, contain patterns which allow us to compress them. 3/17

So real-world stuff, from datasets to math, can be described succinctly. Now, imagine I give you a few data points, say with labels generated by some chaotic function f. If I ask you to write a short program that outputs their values, you’d write up a small lookup table. 4/17

However, if I gave you enough data points, eventually it’s much much shorter for you to just write down the function f itself. Short and efficient descriptions represent the core underlying structure rather than rote memorization. 5/17

ML problems, like classification or generation, can be performed by surprisingly short programs. When your training dataset is small, it’s easy to memorize the training data, just like a lookup table. You can think of lookup tables as memorization. 6/17

But when the dataset gets huge, it’s actually less complex to just learn the real rules than it is to memorize! Processes like reasoning or writing text are highly structured and follow patterns. So if we fit our training data efficiently, we’ll learn those structures too. 7/17

Here are two ingredients our models need in order to learn these structures: (1) They need to have the power to represent the structure. (2) Importantly, they need to have a simplicity bias that prefers efficient (low-complexity) solutions. 8/17

If they fail (1), then they are hopeless to learn math or reasoning, so we need flexible models. For example, current architectures can’t perform multiplication well. If they fail (2), then they will just memorize their training data without learning anything of substance. 9/17

People often think that while bigger LLMs may better satisfy (1), they must also have an easier time memorizing or overfitting and hence fail (2). In reality, it’s entirely possible for a larger model to have an even stronger low-complexity bias than a smaller one! 10/17

It turns out that neural networks of all sorts are strongly biased towards simple, or compressible, functions. There is a whole field of PAC-Bayes generalization theory built around this idea which can mathematically explain why neural networks perform so well. 11/17

In fact, neural networks (or any other model for that matter) which are sufficiently compressible are formally guaranteed to generalize well to new and unseen test samples. 12/17
arxiv.org/abs/2211.13609

In the future, we need to design models that can efficiently represent patterns like mathematics, reasoning, and language, and they will *automatically* learn these patterns from large-scale datasets. Our current models are inefficient for lots of things. 13/17

Some problems like multiplication cannot be solved efficiently by transformers, and our current LLMs use an amount of compute dependent only on input/output lengths, preventing them from solving problems like chess where a small board requires tons of compute to solve. 14/17

Nonetheless, even previous generations of LLMs learned complex structures beyond memorizing their training data, such as in-context learning and induction heads. 15/17
transformer-circuits.pub/2022/in-contex…

@m_finzi

Check out our paper, with @m_finzi, Keefer Rowan, @andrewgwils, where we show just how important simplicity bias, formalized using Kolmogorov complexity, is for machine learning. The paper is easy to approach for all audiences! 16/17
arxiv.org/abs/2304.05366

With enough text, the shortest program to reproduce it is one that first builds an AGI to solve the problem, and then uses it to solve it! 🖐️🎤 17/17

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @micahgoldblum

Micah Goldblum

@micahgoldblum

Nov 12, 2022

https://twitter.com/ylecun/status/1591463668612730880

The following statement, while a commonly held view, is actually false! “Learning theory says that the more functions your model can represent, the more samples it needs to learn anything”. 1/8

https://twitter.com/ylecun/status/1591463668612730880

While it is true that a model which can only express few functions needs few samples to learn, the converse is not true! This underscores the failure of ideas like VC dimension and Rademacher complexity to explain neural network generalization. 2/8

Priors, i.e. inductive biases, need not restrict the function class at all. We actually rely on this principle all the time! Flexible neural networks prefer simple functions, even though they can express complex ones, which allows them to generalize. 3/8

Read 8 tweets

Micah Goldblum

@micahgoldblum

Oct 13, 2022

How much data are augmentations worth? We show that augmentations can actually be worth more than extra data and invariance! They increase variance across batches, and this extra stochasticity finds flatter minima. arxiv.org/abs/2210.06441 1/8

As we gather more and more data, if we train without augmentations, we expect to saturate the performance of our model. This is not true under data augmentations! If augmentations are inconsistent with the data distribution, we will never overcome them. 2/8

Data augmentations like horizontal flips and random crops or fancy policies like AutoAugment are all known to improve accuracy, but they behave differently as you scale the size of the training set. 3/8

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter Twitter Thread URL to Unroll

Micah Goldblum

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @micahgoldblum

Micah Goldblum

Micah Goldblum

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!