Tweet

More from @amanrsanger

Aman Sanger

@amanrsanger

May 18

twitter.com/i/web/status/1…

Palm2 has been leaked to be 340B params and trained on 3.6T tokens (7.4e24 FLOPs).

Someone out there could feasibly reproduce a similar quality model...

for under $6M!

But that price tag largely depends on H100s...

[1/6] twitter.com/i/web/status/1…

First, let's see what happens when we try to use A100s.

At this model scale, A100s can hit 60%+ of their max flop count (312 TFLOPs).

So, running on A100s would cost (7.4e24/(312*.6)e12 * (3600s/hr)) = 11M A100 hours.

Given (dirt-cheap) $1/hr A100 pricing, that's $11M

[2/6]

H100s have a max flop count of 2000 TFLOPs (at fp8).

But we probs won't get near 60% max flops bc of memory bandwidth

Memory bandwidth increases from 1.5 TB/s in A100s to 3.35 TB/s.

Since fp8 is half the memory of bf16, that's an effective 4.5x in memory bandwidth...

[3/6]

Read 6 tweets

Aman Sanger

@amanrsanger

May 12

twitter.com/i/web/status/1…

Llama and many recent open-source models have a significant architectural limitation

They use multi-head attention instead of multi-query attention (which is used by PaLM and probs Claude 100K)

This can result in slowdowns of up to 30x

Heres the math behind why (1/n) twitter.com/i/web/status/1…

Let's consider the widely used 7B param llama architecture.

It has 32 layers, 32 heads, and d_k, d_v sizes of 128

The key issue with multi-head attention is the cost of repeatedly accessing the previously computed attention keys and values during inference.

why? (2/n)

How does multi-head attention work when generating the Nth token?

To generate each token, we calculate a query for each head. Then, we look at the keys and values for prev tokens and apply the op:

Softmax(q^TK/sqrt(d)) *V

where head q = (d_k,1) k=(d_k,N) and v=(d_v, N)

(3/12)

Read 12 tweets

Aman Sanger

@amanrsanger

May 11

twitter.com/i/web/status/1…

The size of all code/history on Github public repos is 92TB

The size of Google's monorepo in 2015 was 86TB (of much higher quality code)

If Google were willing to deploy code models trained on their own data, they'd have a noticable advantage over everyone else. twitter.com/i/web/status/1…

To be fair, they would have to use the diff/history data much more than the raw code.

There is almost certainly more code at Github HEAD, but Google may benefit from a richer commit history

And, I'd suspect the size of their mono repo would have substantially increased since

https://twitter.com/abacaj/status/1655672478578909210?s=20

Furthermore, these high-quality code tokens aren't just good for code, but incredibly useful for general language modeling performance:

https://twitter.com/abacaj/status/1655672478578909210?s=20

Read 5 tweets

Aman Sanger

@amanrsanger

May 10

twitter.com/i/web/status/1…

Palm2 just dropped, and there are claims that the largest model is just 14.7B params.

In reality, the model is probably closer to 100B parameters

But why... (1/5) twitter.com/i/web/status/1…

The 14.7B param number comes from the scaling experiments they ran. The largest model in these experiments was 14.7B params

A (very rough) approximation of the curves from the paper gives something like:

n_params = sqrt(FLOPS/100)

(2/5)

We also know PaLM 2 used more flops than Palm, which required 2.52e24 FLOPs [1]

For that number, the optimal number of params is 160B

(3/5)

Read 6 tweets

Aman Sanger

@amanrsanger

Mar 16

Want to code using GPT-4? We made an IDE built for programming alongside it

Try out the public beta here: cursor.so

@openai

We're also thrilled to announce our partnership with @openai through their early startup program!

After partnering with them, we were given early access to GPT-4.

Since then, we've worked on designing an interface that captures how powerful these models truly are.

And now that GPT-4 is publicly available, we have a flurry of features we'll be shipping soon.

It completely changes the game on what is possible with AI-assistive programming.

Read 4 tweets

Aman Sanger

@amanrsanger

Mar 1

There are times and places for training your own models... With the release OpenAI's chatGPT API - coding is looking less like one of them.

The human-eval pass@1 rate of ChatGPT is as good as the best Open Source model's pass@100 rate.

And this is still just GPT 3.5...

Not only that, but 10x better pricing than text-davinci, and far lower latency.

After seeing this news today, I really would not want to be one of OpenAI's competitors

For those unfamiliar with pass@k, this means if I took the best open-source code model (CodeGen 16B) and sampled 100 generations, the probability that 1 of those 100 generations was correct is the same as the probability ChatGPT gets it right on the first try.

Read 4 tweets

Share this page!

Enter Twitter Thread URL to Unroll

Aman Sanger

Try unrolling a thread yourself!

More from @amanrsanger

Aman Sanger

Aman Sanger

Aman Sanger

Aman Sanger

Aman Sanger

Aman Sanger

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!