Davis Blalock Profile picture
Research scientist + first hire @MosaicML. @MIT PhD. I write + retweet threads about machine learning papers. Paper summaries newsletter: https://t.co/xX7NIpsIVZ

Mar 16, 2023, 21 tweets

Here are the highlights from #OpenAI 's 98-page technical report on #gpt4:
[1/n]

First, scaling is still going strong. We haven’t saturated the log-log-linear trend yet. [2/n]

This holds not just for the pretraining objective (next word prediction), but also for various downstream tasks. [3/n]

GPT-4 overcomes at least some of the poor behavior of previous large models. E.g., it bucks the trend of larger models ignoring expected values once they learn what actually happened. [4/n]

GPT-4 is at least as good as GPT-3 on standardized tests and usually better than most humans.

Interestingly, it’s better at the GRE math and verbal sections than the writing section, despite it being trained to….well, write. [5/n]

It also does great on academic benchmarks, often cutting the error rate by more than half vs the previous state-of-the-art. [6/n]

It’s best at understanding English, but is also good at various other languages—especially the ones with more data. [7/n]

It’s better at responding with correct information than GPT-3 was, even when you try to trick it. [8/n]

Interestingly, training with human feedback offers more improvement on truthfulness than switching from GPT-3 to GPT-4 does. [9/n]

But this improvement in truthfulness comes at the cost of the output probabilities no longer being calibrated well. [10/n]

GPT-4 is better at identifying when it should refuse to answer your question (for the creators’ definition of “should”). [11/n]

And last but not least, a major change is that it can take in images, not just text. This isn’t generally available through the API yet though. [12/n]

So…what does this mean?

First, OpenAI is not even pretending to be “open” anymore. [13/n]

Second, getting a model like this working is a huge effort. They have three pages of credits. [14/n]

Part of its high accuracy might stem from data leakage. [15/n]

On the other hand, if your training set is the whole internet and most books ever written, leakage doesn’t always matter.

E.g., Google search is supposed to only return results straight out of its “training set”. [16/n]

Lastly, it looks like we’re headed for a world where AI makes it dirt cheap to generate content advocating the creators’ viewpoints but not others.

The first example suggests that just holding the wrong views can make it refuse to help you—even if you aren’t requesting advocacy.

Content moderation is hard, and it looks like this problem is now getting generalized to content generation. [18/n]

Overall, there weren’t too many surprises in this report. It’s GPT-3 + images with even better output quality, which is more or less what the rumors indicated. [19/n]

That said, the fact that they don’t advertise as many huge, qualitative jumps is testament to just how good GPT-3 is—e.g., if you’re already writing a coherent essay in perfect English, there’s not much room do much better.

Overall, impressive work by OpenAI. [20/20]

P.S.: If you like these sorts of breakdowns of the latest machine learning research, you might like following me or my newsletter: dblalock.substack.com

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling