Davis Blalock Profile picture
Research scientist + first hire @MosaicML. @MIT PhD. I write + retweet threads about machine learning papers. Paper summaries newsletter: https://t.co/xX7NIpsIVZ
Jerome Ku Profile picture dragondelis 🇺🇦 Profile picture 4 subscribed
Mar 1 10 tweets 3 min read
I've never seen claims of full bf16 parity with <2bit weights before, so there's reason to be cautiously optimistic here.

But since people seem to have the "optimistic" down, let me add some caution:

1) Despite the title, this paper does not use 1-bit weights. Instead, [1/n] ...it uses ternary quantization, requiring each weight to be one of {-𝛼, 0, 𝛼} for some tensor-specific 𝛼.

This takes *2* bits if stored naively, ~1.58 with perfect entropy coding, and 1.6 in the likely case that you pack 5 values in 8 bits (3^5 = 243 <= 255). [2/n]
Apr 29, 2023 9 tweets 2 min read
I've written about 500+ machine learning papers in the past year. Here are some of my most popular threads: [1/n]
Apr 23, 2023 11 tweets 5 min read
"FP8 versus INT8 for efficient deep learning inference"

Is fp8 just plain better than int8?

No. There are tradeoffs between the two at various levels of the stack, and this paper digs into their strengths and weaknesses. [1/11] Image First, for a fixed number of bits, floating point addition takes more transistors. [2/11] Image
Apr 22, 2023 13 tweets 6 min read
"UniverSeg: Universal Medical Image Segmentation"

What if we could train a single neural net to highlight important structures in any medical image given just a few examples? [1/13] Image They make this happen by assembling a huge dataset, designing an appropriate model, and using a particular training setup.

First, they aggregate a ton of medical imaging datasets into a large corpus called MegaMedical. [2/13] Image
Apr 2, 2023 9 tweets 5 min read
"The effectiveness of MAE pre-pretraining for billion-scale pretraining"

Before you pretrain your vision model on image-caption pairs, you should pre-pretrain it with a masked autoencoding objective. This improves downstream accuracy across a variety of tasks. [1/9] The first result here is that, as you’d hope, this approach works better when you use larger models and datasets. In particular, using a 3 billion sample Instagram {image, hashtag} dataset works better than just ImageNet-1k. [2/9]
Apr 1, 2023 11 tweets 5 min read
Imagine a world where keyboards only let you type sentences that the keyboard manufacturer agrees with.

Or where spellcheck and autocorrect work if you're arguing for one side of a debate, but not the other.

That's the world we're building with AI services like ChatGPT. [1/11] The examples above aren't cherrypicked. Choose any controversy and there's a good chance ChatGPT will only help you support one side.

But is this inevitable? What's really going on here? [2/11]
Mar 29, 2023 16 tweets 6 min read
"SemDeDup: Data-efficient learning at web-scale through semantic deduplication"

They propose to deduplicate training sets by embedding each sample, clustering all the embeddings with k-means, and then… [1/16] …eliminating samples within a cluster that have too high a cosine similarity to one another. The idea is that eliminating redundancy in the training set should improve training. [2/16]
Mar 16, 2023 21 tweets 6 min read
Here are the highlights from #OpenAI 's 98-page technical report on #gpt4:
[1/n] First, scaling is still going strong. We haven’t saturated the log-log-linear trend yet. [2/n]
Dec 13, 2022 21 tweets 5 min read
Here are all the ways to get around ChatGPT's safeguards:

[1/n]
Dec 10, 2022 16 tweets 6 min read
"Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities"

We wrote a big ol’ survey paper on efficient neural net training, with an emphasis on practical recommendations. There are four main elements: [1/16] First is just collecting and categorizing a lot of papers on efficient training. We didn’t capture all of them, but there’s a wide selection that should give you a feel for what’s out there. [2/16]
Oct 29, 2022 14 tweets 9 min read
"What Makes Convolutional Models Great on Long Sequence Modeling?"

CNNs—not transformers—now dominate the hardest sequence modeling benchmark.

Here's how this happened: [1/14] First, let's be precise: by "the hardest sequence modeling benchmark," I mean the Long Range Arena (arxiv.org/abs/2011.04006). This consists of tasks like Pathfinder (shown above) that are designed to require modeling of long-range dependencies. [2/14]
Oct 22, 2022 12 tweets 6 min read
"How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization"

A ton of cool experiments on the effects of data augmentations and when you should use different ones. [1/12] One observation is that different augmentations help in different data regimes. With not much data, aggressive augmentations are better. With more data, conservative augmentations like horizontal flipping… [2/12]
Oct 21, 2022 14 tweets 6 min read
"Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers"

Transformer activations tend to be really sparse. What's up with this? And can we exploit it? [1/14] First, it’s not that most neurons are dead, but that nearly all neurons fire rarely. It’s only a handful that fire more than half the time. [2/14]
Oct 20, 2022 6 tweets 3 min read
"Scaling Laws for a Multi-Agent Reinforcement Learning Model"

Do log-log scaling laws show up for reinforcement learning like they do for language modeling? [1/6] Image At least for the Connect Four and Pentago agents they trained, the answer is yes. And interestingly, the exponents for the two games are nearly identical. [2/6] Image
Oct 11, 2022 12 tweets 4 min read
"Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability"

They show why, under reasonable assumptions, gradient descent tends to hover at the edge of stability. Basically, it's a 3-step negative feedback loop. [1/12] To understand this feedback loop, first recall that the edge of stability means that the operator norm of the local Hessian (the "curvature") is 2/η, where η is the learning rate. 2/η is the most curvature one can have without diverging. [2/12]
Sep 24, 2022 11 tweets 3 min read
"On the Horizon: Interactive and Compositional Deepfakes"

So you probably know that neural nets can generate videos of people saying stuff they never said. But Microsoft’s chief science officer articulates two threats beyond this that could be way worse: [1/11] The first is the "interactive deepfake." This is not just static content, but the illusion of talking to a real person. Imagine a scammer calling your grandmom who looks and sounds exactly like you. Or thinking you're meeting someone online but actually it's a bot. [2/11]
Sep 8, 2022 8 tweets 3 min read
"Normalized Activation Function: Toward Better Convergence"

Ops like BatchNorm and LayerNorm keep the activation variances constant in the forward pass. But what about gradient variances in the backward pass? Can we add ops that solve vanishing/exploding gradients? [1/8] Surprisingly...yes. The idea is simple: if your activation function is λ times as steep, your gradient with respect to the input gets scaled by λ. So if you do… [2/8]
Sep 7, 2022 9 tweets 4 min read
"Self-Supervised Pretraining for 2D Medical Image Segmentation"

How should you pretrain your model if your goal is maximizing downstream segmentation quality? [1/9] In particular, should you 1) do supervised or self-supervised pretraining on a general-purpose corpus like ImageNet? And 2) bother adding a second self-supervised learning step on domain-specific data? [2/9]
Sep 6, 2022 9 tweets 10 min read
"Lottery Pools: Winning More by Interpolating Tickets without Increasing Training or Inference Cost"

So let’s say you’re pruning a neural net and want the best model you can get at the end. More precisely,… [1/9] …suppose you're iteratively training the network, picking a subset of weights to keep, rewinding those weights back to a checkpoint early in training, and then fine-tuning from there (see @alex_renda_, @jefrankle, & @mcarbin 2020: arxiv.org/abs/2003.02389) [2/9]
Sep 5, 2022 8 tweets 4 min read
"Bugs in the Data: How ImageNet Misrepresents Biodiversity"

ImageNet-1k labels for animals are especially bad. [1/8] Image While ImageNet as a whole is about 10% mislabeled, it seems to be 12.3% mislabeled for animals. [2/8] Image
Sep 3, 2022 14 tweets 5 min read
"Using Large Language Models to Simulate Multiple Humans"

What if we asked langauge models to complete text describing what a human would do in a situation? Would they produce realistic answers? How close to human behavior would they get? [1/14] The authors of this paper answer these questions by simulating classic psych studies, with participant responses given by GPT-3 variants. [2/14]