Latest Twitter Threads by @davisblalock on Thread Reader App

Jul 3 • 12 tweets • 2 min read

While I briefly have no employer, let me tell you what's really happening with AI companies training on public data: [1/n] There are roughly two groups of actors:
1. Those that care about US + EU laws and regulations.
2. Those that don't.

But both look the same from the outside.

Jul 1 • 11 tweets • 4 min read

Deep learning training is a mathematical dumpster fire.

But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]

By “dumpster fire”, I mean not just well-known issues like vanishing gradients or loss spikes, but also subtle stuff like the variance of your token embeddings collapsing in hard-to-model ways as your sequence length grows. [2/11]

Mar 1, 2024 • 10 tweets • 3 min read

I've never seen claims of full bf16 parity with <2bit weights before, so there's reason to be cautiously optimistic here.

But since people seem to have the "optimistic" down, let me add some caution:

1) Despite the title, this paper does not use 1-bit weights. Instead, [1/n]

https://twitter.com/_akhaliq/status/1763374329457189283

...it uses ternary quantization, requiring each weight to be one of {-𝛼, 0, 𝛼} for some tensor-specific 𝛼.

This takes *2* bits if stored naively, ~1.58 with perfect entropy coding, and 1.6 in the likely case that you pack 5 values in 8 bits (3^5 = 243 <= 255). [2/n]

Apr 29, 2023 • 9 tweets • 2 min read

I've written about 500+ machine learning papers in the past year. Here are some of my most popular threads: [1/n]

https://twitter.com/davisblalock/status/1558347542101839873

Apr 23, 2023 • 11 tweets • 5 min read

"FP8 versus INT8 for efficient deep learning inference"

Is fp8 just plain better than int8?

No. There are tradeoffs between the two at various levels of the stack, and this paper digs into their strengths and weaknesses. [1/11]

First, for a fixed number of bits, floating point addition takes more transistors. [2/11]

Apr 22, 2023 • 13 tweets • 6 min read

"UniverSeg: Universal Medical Image Segmentation"

What if we could train a single neural net to highlight important structures in any medical image given just a few examples? [1/13]

They make this happen by assembling a huge dataset, designing an appropriate model, and using a particular training setup.

First, they aggregate a ton of medical imaging datasets into a large corpus called MegaMedical. [2/13]

Apr 2, 2023 • 9 tweets • 5 min read

"The effectiveness of MAE pre-pretraining for billion-scale pretraining"

Before you pretrain your vision model on image-caption pairs, you should pre-pretrain it with a masked autoencoding objective. This improves downstream accuracy across a variety of tasks. [1/9]

The first result here is that, as you’d hope, this approach works better when you use larger models and datasets. In particular, using a 3 billion sample Instagram {image, hashtag} dataset works better than just ImageNet-1k. [2/9]

Apr 1, 2023 • 11 tweets • 5 min read

Imagine a world where keyboards only let you type sentences that the keyboard manufacturer agrees with.

Or where spellcheck and autocorrect work if you're arguing for one side of a debate, but not the other.

That's the world we're building with AI services like ChatGPT. [1/11]

The examples above aren't cherrypicked. Choose any controversy and there's a good chance ChatGPT will only help you support one side.

But is this inevitable? What's really going on here? [2/11]

Mar 29, 2023 • 16 tweets • 6 min read

"SemDeDup: Data-efficient learning at web-scale through semantic deduplication"

They propose to deduplicate training sets by embedding each sample, clustering all the embeddings with k-means, and then… [1/16]

…eliminating samples within a cluster that have too high a cosine similarity to one another. The idea is that eliminating redundancy in the training set should improve training. [2/16]

Mar 16, 2023 • 21 tweets • 6 min read

Here are the highlights from #OpenAI 's 98-page technical report on #gpt4:
[1/n] First, scaling is still going strong. We haven’t saturated the log-log-linear trend yet. [2/n]

Dec 13, 2022 • 21 tweets • 5 min read

Here are all the ways to get around ChatGPT's safeguards:

[1/n]

https://twitter.com/goodside/status/1598253337400717313

Dec 10, 2022 • 16 tweets • 6 min read

"Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities"

We wrote a big ol’ survey paper on efficient neural net training, with an emphasis on practical recommendations. There are four main elements: [1/16]

First is just collecting and categorizing a lot of papers on efficient training. We didn’t capture all of them, but there’s a wide selection that should give you a feel for what’s out there. [2/16]

Oct 29, 2022 • 14 tweets • 9 min read

"What Makes Convolutional Models Great on Long Sequence Modeling?"

CNNs—not transformers—now dominate the hardest sequence modeling benchmark.

Here's how this happened: [1/14]

First, let's be precise: by "the hardest sequence modeling benchmark," I mean the Long Range Arena (arxiv.org/abs/2011.04006). This consists of tasks like Pathfinder (shown above) that are designed to require modeling of long-range dependencies. [2/14]

Oct 22, 2022 • 12 tweets • 6 min read

"How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization"

A ton of cool experiments on the effects of data augmentations and when you should use different ones. [1/12]

One observation is that different augmentations help in different data regimes. With not much data, aggressive augmentations are better. With more data, conservative augmentations like horizontal flipping… [2/12]

Oct 21, 2022 • 14 tweets • 6 min read

"Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers"

Transformer activations tend to be really sparse. What's up with this? And can we exploit it? [1/14]

First, it’s not that most neurons are dead, but that nearly all neurons fire rarely. It’s only a handful that fire more than half the time. [2/14]

Oct 20, 2022 • 6 tweets • 3 min read

"Scaling Laws for a Multi-Agent Reinforcement Learning Model"

Do log-log scaling laws show up for reinforcement learning like they do for language modeling? [1/6]

At least for the Connect Four and Pentago agents they trained, the answer is yes. And interestingly, the exponents for the two games are nearly identical. [2/6]

Oct 11, 2022 • 12 tweets • 4 min read

"Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability"

They show why, under reasonable assumptions, gradient descent tends to hover at the edge of stability. Basically, it's a 3-step negative feedback loop. [1/12]

To understand this feedback loop, first recall that the edge of stability means that the operator norm of the local Hessian (the "curvature") is 2/η, where η is the learning rate. 2/η is the most curvature one can have without diverging. [2/12]

Sep 24, 2022 • 11 tweets • 3 min read

"On the Horizon: Interactive and Compositional Deepfakes"

So you probably know that neural nets can generate videos of people saying stuff they never said. But Microsoft’s chief science officer articulates two threats beyond this that could be way worse: [1/11]

The first is the "interactive deepfake." This is not just static content, but the illusion of talking to a real person. Imagine a scammer calling your grandmom who looks and sounds exactly like you. Or thinking you're meeting someone online but actually it's a bot. [2/11]

Sep 8, 2022 • 8 tweets • 3 min read

"Normalized Activation Function: Toward Better Convergence"

Ops like BatchNorm and LayerNorm keep the activation variances constant in the forward pass. But what about gradient variances in the backward pass? Can we add ops that solve vanishing/exploding gradients? [1/8]

Surprisingly...yes. The idea is simple: if your activation function is λ times as steep, your gradient with respect to the input gets scaled by λ. So if you do… [2/8]

Sep 7, 2022 • 9 tweets • 4 min read

"Self-Supervised Pretraining for 2D Medical Image Segmentation"

How should you pretrain your model if your goal is maximizing downstream segmentation quality? [1/9]

In particular, should you 1) do supervised or self-supervised pretraining on a general-purpose corpus like ImageNet? And 2) bother adding a second self-supervised learning step on domain-specific data? [2/9]

Sep 6, 2022 • 9 tweets • 10 min read

"Lottery Pools: Winning More by Interpolating Tickets without Increasing Training or Inference Cost"

So let’s say you’re pruning a neural net and want the best model you can get at the end. More precisely,… [1/9]

…suppose you're iteratively training the network, picking a subset of weights to keep, rewinding those weights back to a checkpoint early in training, and then fine-tuning from there (see @alex_renda_, @jefrankle, & @mcarbin 2020: arxiv.org/abs/2003.02389) [2/9]

Share this page!

Enter URL or ID to Unroll