Thread by @SuryaGanguli on Thread Reader App

1/Is scale all you need for AGI?(unlikely).But our new paper "Beyond neural scaling laws:beating power law scaling via data pruning" shows how to achieve much superior exponential decay of error with dataset size rather than slow power law neural scaling arxiv.org/abs/2206.14486

2/ In joint work @MetaAI w/Ben Sorscher, Robert Geirhos, Shashank Shekhar & @arimorcos we show both in theory (via statistical mechanics) and practice how to achieve exponential scaling by only learning on selected data subsets of difficult nonredundant examples(defined properly)

3/ Our statistical mechanics theory of data pruning makes several predictions - including the ability to beat power scaling - which we confirm in ResNets on various tasks (SVHN, CIFAR10, ImageNet) and Vision Transformers fined-tuned on CIFAR10

4/ Then focusing on ImageNet, we performed a large scale benchmarking study of 10 different data-pruning metrics that rank examples from easiest to hardest and tested their efficacy in pruning data to create small data subsets of only the hardest examples to train on

5/ We additionally developed a new unsupervised data pruning metric that does not even require labels, is easy to compute given a pre-trained foundation model, and that out performs all previous metrics on ImageNet, allowing us to train on ~75% of ImageNet without accuracy loss

6/ Overall this work suggests that our current ML practice of collecting large amounts of random data is highly inefficient, leading to huge redundancy in the data, which we show mathematically is the origin of very slow, unsustainable power law scaling of error with dataset size

7/ A better way forward might be the creation of foundation datasets: carefully curated subsets of small amounts of data that are capable of training highly accurate models using far less data than we currently use in our large randomly selected datasets (see discussion in paper)

8/ Indeed, the initial computational cost of creating a foundation dataset through data pruning can be amortized across efficiency gains in training
many downstream models, just as the initial cost of training foundation models is amortized across faster fine-tuning on many tasks

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll