3 added to My Authors

Dec 13, 2022 • 21 tweets • 5 min read

[1/n]

https://twitter.com/goodside/status/1598253337400717313

Dec 10, 2022 • 16 tweets • 6 min read

We wrote a big ol’ survey paper on efficient neural net training, with an emphasis on practical recommendations. There are four main elements: [1/16] First is just collecting and categorizing a lot of papers on efficient training. We didn’t capture all of them, but there’s a wide selection that should give you a feel for what’s out there. [2/16]

Oct 29, 2022 • 14 tweets • 9 min read

CNNs—not transformers—now dominate the hardest sequence modeling benchmark.

Here's how this happened: [1/14] First, let's be precise: by "the hardest sequence modeling benchmark," I mean the Long Range Arena (arxiv.org/abs/2011.04006). This consists of tasks like Pathfinder (shown above) that are designed to require modeling of long-range dependencies. [2/14]

Oct 22, 2022 • 12 tweets • 6 min read

A ton of cool experiments on the effects of data augmentations and when you should use different ones. [1/12] One observation is that different augmentations help in different data regimes. With not much data, aggressive augmentations are better. With more data, conservative augmentations like horizontal flipping… [2/12]

Oct 21, 2022 • 14 tweets • 6 min read

Transformer activations tend to be really sparse. What's up with this? And can we exploit it? [1/14] First, it’s not that most neurons are dead, but that nearly all neurons fire rarely. It’s only a handful that fire more than half the time. [2/14]

Oct 20, 2022 • 6 tweets • 3 min read

Do log-log scaling laws show up for reinforcement learning like they do for language modeling? [1/6] At least for the Connect Four and Pentago agents they trained, the answer is yes. And interestingly, the exponents for the two games are nearly identical. [2/6]

Oct 11, 2022 • 12 tweets • 4 min read

They show why, under reasonable assumptions, gradient descent tends to hover at the edge of stability. Basically, it's a 3-step negative feedback loop. [1/12] To understand this feedback loop, first recall that the edge of stability means that the operator norm of the local Hessian (the "curvature") is 2/η, where η is the learning rate. 2/η is the most curvature one can have without diverging. [2/12]

Sep 24, 2022 • 11 tweets • 3 min read

So you probably know that neural nets can generate videos of people saying stuff they never said. But Microsoft’s chief science officer articulates two threats beyond this that could be way worse: [1/11] The first is the "interactive deepfake." This is not just static content, but the illusion of talking to a real person. Imagine a scammer calling your grandmom who looks and sounds exactly like you. Or thinking you're meeting someone online but actually it's a bot. [2/11]

Sep 8, 2022 • 8 tweets • 3 min read

Ops like BatchNorm and LayerNorm keep the activation variances constant in the forward pass. But what about gradient variances in the backward pass? Can we add ops that solve vanishing/exploding gradients? [1/8] Surprisingly...yes. The idea is simple: if your activation function is λ times as steep, your gradient with respect to the input gets scaled by λ. So if you do… [2/8]

Sep 7, 2022 • 9 tweets • 4 min read

How should you pretrain your model if your goal is maximizing downstream segmentation quality? [1/9] In particular, should you 1) do supervised or self-supervised pretraining on a general-purpose corpus like ImageNet? And 2) bother adding a second self-supervised learning step on domain-specific data? [2/9]

Sep 6, 2022 • 9 tweets • 10 min read

So let’s say you’re pruning a neural net and want the best model you can get at the end. More precisely,… [1/9] …suppose you're iteratively training the network, picking a subset of weights to keep, rewinding those weights back to a checkpoint early in training, and then fine-tuning from there (see @alex_renda_, @jefrankle, & @mcarbin 2020: arxiv.org/abs/2003.02389) [2/9]

Sep 3, 2022 • 14 tweets • 5 min read

What if we asked langauge models to complete text describing what a human would do in a situation? Would they produce realistic answers? How close to human behavior would they get? [1/14] The authors of this paper answer these questions by simulating classic psych studies, with participant responses given by GPT-3 variants. [2/14]

Aug 27, 2022 • 15 tweets • 6 min read

For two years, the AI world has had this glorious period of believing that big tech companies just need more compute to make their models better, not more user data.

That period is ending. Here's what happened: [1/14] In 2020, OpenAI published a paper (arxiv.org/abs/2001.08361) assessing the relative effects of scaling up models vs datasets. They found that scaling up models had *way* higher returns. [2/14]

Aug 25, 2022 • 8 tweets • 3 min read

Aug 23, 2022 • 11 tweets • 3 min read

Another optimizer paper attempting to descend through a crowded valley to beat Adam. But...maybe this one actually does? [1/11] Their update equation is fairly straightforward, and complements the gradient momentum term with a difference-of-gradients momentum term. [2/11]

Aug 21, 2022 • 11 tweets • 7 min read

Can models learn new, non-trivial functions...with no parameter changes? Turns out the answer is yes, with in-context learning: [1/11] In-context learning is when you include some examples as text in the prompt at test time. Here's a great illustration from @sewon__min et al. (arxiv.org/abs/2202.12837). [2/11]

Aug 17, 2022 • 13 tweets • 5 min read

How well do image-text models trained on a given dataset generalize to other datasets? [1/12] The answer is: it’s complicated. Different pretraining datasets work better for different downstream datasets. [2/12]

Aug 13, 2022 • 11 tweets • 4 min read

This paper changed my thinking about what future langauge models will be good at, mostly in a really concerning way. Let's start with some context: [1/11] To teach models to program, you used to give them a natural language prompt. But recent work has shown that you can instead just show them a unit test and tell them to… [2/11]

Aug 6, 2022 • 10 tweets • 4 min read

Are vision transformers really better than CNNs? This paper strongly suggests an answer, based on a robustness throwdown between {ViT, Swin} vs {BiT, ConvNeXt}. [1/10] First, they measure the learning of spurious features using datasets designed to assess simplicity bias, background bias, and texture bias. The transformers and the CNNs behave similarly. [2/10]

Aug 4, 2022 • 8 tweets • 5 min read

An intuitive method for making models robust to distribution shift. They replace vectors in the latent space with their nearest centroids, with the clustering… [1/8] …and quantization applied separately to different slices of the feature space. The centroids are learned using a moving average process similar to minibatch k-means. [2/8]