Davis Blalock Profile picture
Research scientist @MosaicML. PhD @MIT. I go through the hundreds of new machine learning papers each week and share my favorites. Newsletter: https://t.co/xX7NIpazHR
Jerome Ku Profile picture 3 added to My Authors
Dec 13, 2022 21 tweets 5 min read
Here are all the ways to get around ChatGPT's safeguards:

[1/n]
Dec 10, 2022 16 tweets 6 min read
"Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities"

We wrote a big ol’ survey paper on efficient neural net training, with an emphasis on practical recommendations. There are four main elements: [1/16] First is just collecting and categorizing a lot of papers on efficient training. We didn’t capture all of them, but there’s a wide selection that should give you a feel for what’s out there. [2/16]
Oct 29, 2022 14 tweets 9 min read
"What Makes Convolutional Models Great on Long Sequence Modeling?"

CNNs—not transformers—now dominate the hardest sequence modeling benchmark.

Here's how this happened: [1/14] First, let's be precise: by "the hardest sequence modeling benchmark," I mean the Long Range Arena (arxiv.org/abs/2011.04006). This consists of tasks like Pathfinder (shown above) that are designed to require modeling of long-range dependencies. [2/14]
Oct 22, 2022 12 tweets 6 min read
"How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization"

A ton of cool experiments on the effects of data augmentations and when you should use different ones. [1/12] One observation is that different augmentations help in different data regimes. With not much data, aggressive augmentations are better. With more data, conservative augmentations like horizontal flipping… [2/12]
Oct 21, 2022 14 tweets 6 min read
"Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers"

Transformer activations tend to be really sparse. What's up with this? And can we exploit it? [1/14] First, it’s not that most neurons are dead, but that nearly all neurons fire rarely. It’s only a handful that fire more than half the time. [2/14]
Oct 20, 2022 6 tweets 3 min read
"Scaling Laws for a Multi-Agent Reinforcement Learning Model"

Do log-log scaling laws show up for reinforcement learning like they do for language modeling? [1/6] Image At least for the Connect Four and Pentago agents they trained, the answer is yes. And interestingly, the exponents for the two games are nearly identical. [2/6] Image
Oct 11, 2022 12 tweets 4 min read
"Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability"

They show why, under reasonable assumptions, gradient descent tends to hover at the edge of stability. Basically, it's a 3-step negative feedback loop. [1/12] To understand this feedback loop, first recall that the edge of stability means that the operator norm of the local Hessian (the "curvature") is 2/η, where η is the learning rate. 2/η is the most curvature one can have without diverging. [2/12]
Sep 24, 2022 11 tweets 3 min read
"On the Horizon: Interactive and Compositional Deepfakes"

So you probably know that neural nets can generate videos of people saying stuff they never said. But Microsoft’s chief science officer articulates two threats beyond this that could be way worse: [1/11] The first is the "interactive deepfake." This is not just static content, but the illusion of talking to a real person. Imagine a scammer calling your grandmom who looks and sounds exactly like you. Or thinking you're meeting someone online but actually it's a bot. [2/11]
Sep 8, 2022 8 tweets 3 min read
"Normalized Activation Function: Toward Better Convergence"

Ops like BatchNorm and LayerNorm keep the activation variances constant in the forward pass. But what about gradient variances in the backward pass? Can we add ops that solve vanishing/exploding gradients? [1/8] Surprisingly...yes. The idea is simple: if your activation function is λ times as steep, your gradient with respect to the input gets scaled by λ. So if you do… [2/8]
Sep 7, 2022 9 tweets 4 min read
"Self-Supervised Pretraining for 2D Medical Image Segmentation"

How should you pretrain your model if your goal is maximizing downstream segmentation quality? [1/9] In particular, should you 1) do supervised or self-supervised pretraining on a general-purpose corpus like ImageNet? And 2) bother adding a second self-supervised learning step on domain-specific data? [2/9]
Sep 6, 2022 9 tweets 10 min read
"Lottery Pools: Winning More by Interpolating Tickets without Increasing Training or Inference Cost"

So let’s say you’re pruning a neural net and want the best model you can get at the end. More precisely,… [1/9] …suppose you're iteratively training the network, picking a subset of weights to keep, rewinding those weights back to a checkpoint early in training, and then fine-tuning from there (see @alex_renda_, @jefrankle, & @mcarbin 2020: arxiv.org/abs/2003.02389) [2/9]
Sep 5, 2022 8 tweets 4 min read
"Bugs in the Data: How ImageNet Misrepresents Biodiversity"

ImageNet-1k labels for animals are especially bad. [1/8] Image While ImageNet as a whole is about 10% mislabeled, it seems to be 12.3% mislabeled for animals. [2/8] Image
Sep 3, 2022 14 tweets 5 min read
"Using Large Language Models to Simulate Multiple Humans"

What if we asked langauge models to complete text describing what a human would do in a situation? Would they produce realistic answers? How close to human behavior would they get? [1/14] The authors of this paper answer these questions by simulating classic psych studies, with participant responses given by GPT-3 variants. [2/14]
Aug 27, 2022 15 tweets 6 min read
"Understanding Scaling Laws for Recommendation Models"

For two years, the AI world has had this glorious period of believing that big tech companies just need more compute to make their models better, not more user data.

That period is ending. Here's what happened: [1/14] In 2020, OpenAI published a paper (arxiv.org/abs/2001.08361) assessing the relative effects of scaling up models vs datasets. They found that scaling up models had *way* higher returns. [2/14]
Aug 25, 2022 8 tweets 3 min read
"No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects"

Instead of using a pooling layer or having a stride for your conv, just use a space-to-depth op followed by a non-strided conv. [1/8] Image This substitution seems to be an improvement. [2/8] Image
Aug 23, 2022 11 tweets 3 min read
"Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models"

Another optimizer paper attempting to descend through a crowded valley to beat Adam. But...maybe this one actually does? [1/11] Their update equation is fairly straightforward, and complements the gradient momentum term with a difference-of-gradients momentum term. [2/11]
Aug 21, 2022 11 tweets 7 min read
"What Can Transformers Learn In-Context? A Case Study of Simple Function Classes"

Can models learn new, non-trivial functions...with no parameter changes? Turns out the answer is yes, with in-context learning: [1/11] Image In-context learning is when you include some examples as text in the prompt at test time. Here's a great illustration from @sewon__min et al. (arxiv.org/abs/2202.12837). [2/11] Image
Aug 17, 2022 13 tweets 5 min read
"Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP"

How well do image-text models trained on a given dataset generalize to other datasets? [1/12] Image The answer is: it’s complicated. Different pretraining datasets work better for different downstream datasets. [2/12] Image
Aug 13, 2022 11 tweets 4 min read
"Language Models Can Teach Themselves to Program Better"

This paper changed my thinking about what future langauge models will be good at, mostly in a really concerning way. Let's start with some context: [1/11] To teach models to program, you used to give them a natural language prompt. But recent work has shown that you can instead just show them a unit test and tell them to… [2/11]
Aug 6, 2022 10 tweets 4 min read
"An Impartial Take to the CNN vs Transformer Robustness Contest"

Are vision transformers really better than CNNs? This paper strongly suggests an answer, based on a robustness throwdown between {ViT, Swin} vs {BiT, ConvNeXt}. [1/10] First, they measure the learning of spurious features using datasets designed to assess simplicity bias, background bias, and texture bias. The transformers and the CNNs behave similarly. [2/10]
Aug 4, 2022 8 tweets 5 min read
"Discrete Key-Value Bottleneck"

An intuitive method for making models robust to distribution shift. They replace vectors in the latent space with their nearest centroids, with the clustering… [1/8] Image …and quantization applied separately to different slices of the feature space. The centroids are learned using a moving average process similar to minibatch k-means. [2/8]