Davis Blalock Profile picture
Mar 16, 2023 21 tweets 6 min read Read on X
Here are the highlights from #OpenAI 's 98-page technical report on #gpt4:
[1/n]
First, scaling is still going strong. We haven’t saturated the log-log-linear trend yet. [2/n]
This holds not just for the pretraining objective (next word prediction), but also for various downstream tasks. [3/n]
GPT-4 overcomes at least some of the poor behavior of previous large models. E.g., it bucks the trend of larger models ignoring expected values once they learn what actually happened. [4/n]
GPT-4 is at least as good as GPT-3 on standardized tests and usually better than most humans.

Interestingly, it’s better at the GRE math and verbal sections than the writing section, despite it being trained to….well, write. [5/n]
It also does great on academic benchmarks, often cutting the error rate by more than half vs the previous state-of-the-art. [6/n]
It’s best at understanding English, but is also good at various other languages—especially the ones with more data. [7/n]
It’s better at responding with correct information than GPT-3 was, even when you try to trick it. [8/n]
Interestingly, training with human feedback offers more improvement on truthfulness than switching from GPT-3 to GPT-4 does. [9/n]
But this improvement in truthfulness comes at the cost of the output probabilities no longer being calibrated well. [10/n]
GPT-4 is better at identifying when it should refuse to answer your question (for the creators’ definition of “should”). [11/n]
And last but not least, a major change is that it can take in images, not just text. This isn’t generally available through the API yet though. [12/n]
So…what does this mean?

First, OpenAI is not even pretending to be “open” anymore. [13/n]

Second, getting a model like this working is a huge effort. They have three pages of credits. [14/n]
Part of its high accuracy might stem from data leakage. [15/n]
On the other hand, if your training set is the whole internet and most books ever written, leakage doesn’t always matter.

E.g., Google search is supposed to only return results straight out of its “training set”. [16/n]
Lastly, it looks like we’re headed for a world where AI makes it dirt cheap to generate content advocating the creators’ viewpoints but not others.

The first example suggests that just holding the wrong views can make it refuse to help you—even if you aren’t requesting advocacy.
Content moderation is hard, and it looks like this problem is now getting generalized to content generation. [18/n]
Overall, there weren’t too many surprises in this report. It’s GPT-3 + images with even better output quality, which is more or less what the rumors indicated. [19/n]
That said, the fact that they don’t advertise as many huge, qualitative jumps is testament to just how good GPT-3 is—e.g., if you’re already writing a coherent essay in perfect English, there’s not much room do much better.

Overall, impressive work by OpenAI. [20/20]
P.S.: If you like these sorts of breakdowns of the latest machine learning research, you might like following me or my newsletter: dblalock.substack.com

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Davis Blalock

Davis Blalock Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @davisblalock

Mar 1
I've never seen claims of full bf16 parity with <2bit weights before, so there's reason to be cautiously optimistic here.

But since people seem to have the "optimistic" down, let me add some caution:

1) Despite the title, this paper does not use 1-bit weights. Instead, [1/n]
...it uses ternary quantization, requiring each weight to be one of {-𝛼, 0, 𝛼} for some tensor-specific 𝛼.

This takes *2* bits if stored naively, ~1.58 with perfect entropy coding, and 1.6 in the likely case that you pack 5 values in 8 bits (3^5 = 243 <= 255). [2/n]
2) The paper doesn't compare to any other ternary quantization method, so it's unclear how well this particular scheme works.

There have been countless binary and ternary quantization papers over the past decade, so this would almost certainly not get through peer review. [3/n]
Read 10 tweets
Apr 29, 2023
I've written about 500+ machine learning papers in the past year. Here are some of my most popular threads: [1/n]
Read 9 tweets
Apr 23, 2023
"FP8 versus INT8 for efficient deep learning inference"

Is fp8 just plain better than int8?

No. There are tradeoffs between the two at various levels of the stack, and this paper digs into their strengths and weaknesses. [1/11] Image
First, for a fixed number of bits, floating point addition takes more transistors. [2/11] Image
The same is true of multipliers. [3/11] Image
Read 11 tweets
Apr 22, 2023
"UniverSeg: Universal Medical Image Segmentation"

What if we could train a single neural net to highlight important structures in any medical image given just a few examples? [1/13] Image
They make this happen by assembling a huge dataset, designing an appropriate model, and using a particular training setup.

First, they aggregate a ton of medical imaging datasets into a large corpus called MegaMedical. [2/13] Image
Second, they design a modified U-Net whose blocks jointly look at the “query” image and a few reference images. Think of these reference images as in-context examples, and the combination of the input image and these examples as a visual "prompt". [3/13] Image
Read 13 tweets
Apr 2, 2023
"The effectiveness of MAE pre-pretraining for billion-scale pretraining"

Before you pretrain your vision model on image-caption pairs, you should pre-pretrain it with a masked autoencoding objective. This improves downstream accuracy across a variety of tasks. [1/9]
The first result here is that, as you’d hope, this approach works better when you use larger models and datasets. In particular, using a 3 billion sample Instagram {image, hashtag} dataset works better than just ImageNet-1k. [2/9]
It’s not that surprising that throwing more compute at the training process gets you better results.

The key question is whether you’re better off spending some of that compute on a separate masked pre-pretraining phase instead of just pretraining more. [3/9]
Read 9 tweets
Apr 1, 2023
Imagine a world where keyboards only let you type sentences that the keyboard manufacturer agrees with.

Or where spellcheck and autocorrect work if you're arguing for one side of a debate, but not the other.

That's the world we're building with AI services like ChatGPT. [1/11]
The examples above aren't cherrypicked. Choose any controversy and there's a good chance ChatGPT will only help you support one side.

But is this inevitable? What's really going on here? [2/11]
A simple explanation is "OpenAI is woke so ChatGPT is woke."

There's certainly truth to the claim that the model reflects OpenAI's values, insofar as they're the ones who control it.

But I claim that something subtler is happening. [3/11]

nationalreview.com/corner/chatgpt…
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(