Latest Twitter Threads by @Mitchnw on Thread Reader App

Sep 28, 2023 • 5 tweets • 2 min read

Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities:

With fantastic collaborators @peterjliu, @Locchiu, @_katieeverett, many others (see final tweet!), @hoonkp, @jmgilmer, @skornblith!

(1/15) arxiv.org/abs/2309.14322

Researchers have reported training instabilities at large scale that did not appear with the same hyperparameters at smaller scales. However, the resources required made investigation difficult

We seek ways to reproduce, study and predict instability with smaller models

(2/15)

Sep 9, 2021 • 10 tweets • 5 min read

Can zero-shot models such as CLIP be fine-tuned without reducing out-of-distribution accuracy?

Yes! Our new method for robust fine-tuning improves average OOD accuracy by 9% on multiple ImageNet distribution shifts without any loss in-distribution

arxiv.org/abs/2109.01903

(1/9)

Zero-shot models pre-trained on large heterogeneous datasets such as CLIP and ALIGN have demonstrated unprecedented robustness to challenging distribution shifts.

However, with current techniques, fine-tuning often decreases OOD accuracy compared to the zero-shot model.

(2/9)

Dec 2, 2019 • 5 tweets • 3 min read

What's hidden in an overparameterized neural network with random weights? If the distribution is properly scaled (e.g. Kaiming Normal), then it contains a subnetwork which achieves high accuracy without ever modifying the values of the weights...

arxiv.org/abs/1911.13299

(/n)

A randomly weighted Wide ResNet-50 contains a subnetwork that is smaller than, but matches the performance of ResNet-34 on ImageNet :o

(2/n)

Sep 23, 2019 • 6 tweets • 5 min read

Excited to share our blog on ~Discovering Neural Wirings~ expanding on recent work with Ali Farhadi & @morastegari to appear at #NeurIPS2019 (see thread below for more info)!

Blog: mitchellnw.github.io/blog/2019/dnw/
Preprint: arxiv.org/abs/1906.00586
Code: github.com/allenai/dnw

(1/4)

@morastegari Two cool takeaways:
1) In _some_ ways, the problem of NAS and sparse neural network learning are really two sides of the same coin. As NAS becomes more fine grained, finding a good architecture is akin to finding a sparse subnetwork of the complete graph.

(2/4)

Share this page!

Enter URL or ID to Unroll