Mitchell Wortsman Profile picture
grad student at @uwcse
Sep 28, 2023 5 tweets 2 min read
Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities:

With fantastic collaborators @peterjliu, @Locchiu, @_katieeverett, many others (see final tweet!), @hoonkp, @jmgilmer, @skornblith!

(1/15) arxiv.org/abs/2309.14322
Image Researchers have reported training instabilities at large scale that did not appear with the same hyperparameters at smaller scales. However, the resources required made investigation difficult

We seek ways to reproduce, study and predict instability with smaller models

(2/15)
Sep 9, 2021 10 tweets 5 min read
Can zero-shot models such as CLIP be fine-tuned without reducing out-of-distribution accuracy?

Yes! Our new method for robust fine-tuning improves average OOD accuracy by 9% on multiple ImageNet distribution shifts without any loss in-distribution

arxiv.org/abs/2109.01903

(1/9) Schematic scatter plots wit... Zero-shot models pre-trained on large heterogeneous datasets such as CLIP and ALIGN have demonstrated unprecedented robustness to challenging distribution shifts.

However, with current techniques, fine-tuning often decreases OOD accuracy compared to the zero-shot model.

(2/9)
Dec 2, 2019 5 tweets 3 min read
What's hidden in an overparameterized neural network with random weights? If the distribution is properly scaled (e.g. Kaiming Normal), then it contains a subnetwork which achieves high accuracy without ever modifying the values of the weights...

arxiv.org/abs/1911.13299

(/n) A randomly weighted Wide ResNet-50 contains a subnetwork that is smaller than, but matches the performance of ResNet-34 on ImageNet :o

(2/n)
Sep 23, 2019 6 tweets 5 min read
Excited to share our blog on ~Discovering Neural Wirings~ expanding on recent work with Ali Farhadi & @morastegari to appear at #NeurIPS2019 (see thread below for more info)!

Blog: mitchellnw.github.io/blog/2019/dnw/
Preprint: arxiv.org/abs/1906.00586
Code: github.com/allenai/dnw

(1/4) @morastegari Two cool takeaways:
1) In _some_ ways, the problem of NAS and sparse neural network learning are really two sides of the same coin. As NAS becomes more fine grained, finding a good architecture is akin to finding a sparse subnetwork of the complete graph.

(2/4)