Micah Goldblum Profile picture
🤖Prof at Columbia University 🏙️. All things machine learning.🤖
Feb 29 11 tweets 4 min read
Do LLMs simply memorize and parrot their pretraining data or do they learn patterns that generalize? Let’s put this to the test! We compute the first generalization guarantees for LLMs.

w/ @LotfiSanae, @m_finzi, @KuangYilun, @timrudner, @andrewgwils



1/9arxiv.org/abs/2312.17173 LLMs are big, so they can memorize tons of training data, even randomly generated text where generalization is impossible. How do we tell generalization from memorization, especially in models trained on so much data that we don’t know what test samples they haven’t seen yet? 2/9
Apr 20, 2023 17 tweets 4 min read
🚨Here’s an intuitive explanation for why training on lots and lots of data creates emergent properties, for instance math and reasoning, in large language models like #GPT-4 and #ChatGPT 🚨 1/17 Let’s start with the basics. Real-world data is full of patterns and structure. This structure allows us to describe things with simple rules. We exploit this fact all the time, for example to derive laws of physics or differential equations. 2/17
Nov 12, 2022 8 tweets 2 min read
The following statement, while a commonly held view, is actually false! “Learning theory says that the more functions your model can represent, the more samples it needs to learn anything”. 1/8 While it is true that a model which can only express few functions needs few samples to learn, the converse is not true! This underscores the failure of ideas like VC dimension and Rademacher complexity to explain neural network generalization. 2/8
Oct 13, 2022 8 tweets 3 min read
How much data are augmentations worth? We show that augmentations can actually be worth more than extra data and invariance! They increase variance across batches, and this extra stochasticity finds flatter minima. arxiv.org/abs/2210.06441 1/8 As we gather more and more data, if we train without augmentations, we expect to saturate the performance of our model. This is not true under data augmentations! If augmentations are inconsistent with the data distribution, we will never overcome them. 2/8