Davis Blalock Profile picture
Mar 16 21 tweets 6 min read
Here are the highlights from #OpenAI 's 98-page technical report on #gpt4:
First, scaling is still going strong. We haven’t saturated the log-log-linear trend yet. [2/n]
This holds not just for the pretraining objective (next word prediction), but also for various downstream tasks. [3/n]
GPT-4 overcomes at least some of the poor behavior of previous large models. E.g., it bucks the trend of larger models ignoring expected values once they learn what actually happened. [4/n]
GPT-4 is at least as good as GPT-3 on standardized tests and usually better than most humans.

Interestingly, it’s better at the GRE math and verbal sections than the writing section, despite it being trained to….well, write. [5/n]
It also does great on academic benchmarks, often cutting the error rate by more than half vs the previous state-of-the-art. [6/n]
It’s best at understanding English, but is also good at various other languages—especially the ones with more data. [7/n]
It’s better at responding with correct information than GPT-3 was, even when you try to trick it. [8/n]
Interestingly, training with human feedback offers more improvement on truthfulness than switching from GPT-3 to GPT-4 does. [9/n]
But this improvement in truthfulness comes at the cost of the output probabilities no longer being calibrated well. [10/n]
GPT-4 is better at identifying when it should refuse to answer your question (for the creators’ definition of “should”). [11/n]
And last but not least, a major change is that it can take in images, not just text. This isn’t generally available through the API yet though. [12/n]
So…what does this mean?

First, OpenAI is not even pretending to be “open” anymore. [13/n]

Second, getting a model like this working is a huge effort. They have three pages of credits. [14/n]
Part of its high accuracy might stem from data leakage. [15/n]
On the other hand, if your training set is the whole internet and most books ever written, leakage doesn’t always matter.

E.g., Google search is supposed to only return results straight out of its “training set”. [16/n]
Lastly, it looks like we’re headed for a world where AI makes it dirt cheap to generate content advocating the creators’ viewpoints but not others.

The first example suggests that just holding the wrong views can make it refuse to help you—even if you aren’t requesting advocacy.
Content moderation is hard, and it looks like this problem is now getting generalized to content generation. [18/n]
Overall, there weren’t too many surprises in this report. It’s GPT-3 + images with even better output quality, which is more or less what the rumors indicated. [19/n]
That said, the fact that they don’t advertise as many huge, qualitative jumps is testament to just how good GPT-3 is—e.g., if you’re already writing a coherent essay in perfect English, there’s not much room do much better.

Overall, impressive work by OpenAI. [20/20]
P.S.: If you like these sorts of breakdowns of the latest machine learning research, you might like following me or my newsletter: dblalock.substack.com

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Davis Blalock

Davis Blalock Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @davisblalock

Dec 10, 2022
"Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities"

We wrote a big ol’ survey paper on efficient neural net training, with an emphasis on practical recommendations. There are four main elements: [1/16]
First is just collecting and categorizing a lot of papers on efficient training. We didn’t capture all of them, but there’s a wide selection that should give you a feel for what’s out there. [2/16]
Second, we have a taxonomy of training speedup approaches to help make sense of the vast literature. We break down what “component” of training is affected (“function,” “data,” or “optimization”), as well as how it’s getting modified. [3/16]
Read 16 tweets
Oct 29, 2022
"What Makes Convolutional Models Great on Long Sequence Modeling?"

CNNs—not transformers—now dominate the hardest sequence modeling benchmark.

Here's how this happened: [1/14]
First, let's be precise: by "the hardest sequence modeling benchmark," I mean the Long Range Arena (arxiv.org/abs/2011.04006). This consists of tasks like Pathfinder (shown above) that are designed to require modeling of long-range dependencies. [2/14]
For a while, no one did well on the Long Range Arena. In particular, Path-X (a more extreme version of the above Pathfinder task) seemed impossible. [3/14]
Read 14 tweets
Oct 22, 2022
"How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization"

A ton of cool experiments on the effects of data augmentations and when you should use different ones. [1/12]
One observation is that different augmentations help in different data regimes. With not much data, aggressive augmentations are better. With more data, conservative augmentations like horizontal flipping… [2/12]
…are better. Seems like a bias-variance tradeoff wherein distorting the true distribution is worth it for small datasets to get more coverage, but not worth it for large datasets. [3/12]
Read 12 tweets
Oct 21, 2022
"Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers"

Transformer activations tend to be really sparse. What's up with this? And can we exploit it? [1/14]
First, it’s not that most neurons are dead, but that nearly all neurons fire rarely. It’s only a handful that fire more than half the time. [2/14]
Unsurprisingly, deeper and/or wider models have a higher number of nonzeros but a lower percentage of nonzeros. Also, I’m amazed at how consistent the… [3/14]
Read 14 tweets
Oct 20, 2022
"Scaling Laws for a Multi-Agent Reinforcement Learning Model"

Do log-log scaling laws show up for reinforcement learning like they do for language modeling? [1/6] Image
At least for the Connect Four and Pentago agents they trained, the answer is yes. And interestingly, the exponents for the two games are nearly identical. [2/6] Image
What I found most surprising is how few parameters it takes to observe this—the quality vs parameter count relationship looks log-log linear from ~500 to ~2e5 parameters (or log-linear for ELO, a logarithmic quality metric). [3/6] Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!


0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy


3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!