Aman Sanger Profile picture
Aug 12 4 tweets 1 min read Twitter logo Read on Twitter
Sub 600ms latency speech conversational AI is completely possible today, surprised I haven’t seen anyone that does this.

The key is hosting a model (like llama), streaming from whisper, and every few tokens, prefilling more of the kv cache - without evicting from memory (1/4)
Whisper could transcribe the remainder of the text after speaking in 100ms.

This means the time to first voice response is around the time to process the last several tokens of input, then generate the first 20 tokens, then pass that into a text to speech model. (2/4))
With quantized llama 70b using 2-way tensor parallelism, you could process + generate those tokens in 0.4s on two A100s (3/4)
The last step is getting a text to speech model to start talking within 100ms.

Again, this feels quite achievable if hosting a model and streaming the text into the models kv cache.

But I’m less confident on the state of text to speech oss models (4/4)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Aman Sanger

Aman Sanger Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @amanrsanger

May 18
Palm2 has been leaked to be 340B params and trained on 3.6T tokens (7.4e24 FLOPs).

Someone out there could feasibly reproduce a similar quality model...

for under $6M!

But that price tag largely depends on H100s...

[1/6] twitter.com/i/web/status/1… Image
First, let's see what happens when we try to use A100s.

At this model scale, A100s can hit 60%+ of their max flop count (312 TFLOPs).

So, running on A100s would cost (7.4e24/(312*.6)e12 * (3600s/hr)) = 11M A100 hours.

Given (dirt-cheap) $1/hr A100 pricing, that's $11M

[2/6]
H100s have a max flop count of 2000 TFLOPs (at fp8).

But we probs won't get near 60% max flops bc of memory bandwidth

Memory bandwidth increases from 1.5 TB/s in A100s to 3.35 TB/s.

Since fp8 is half the memory of bf16, that's an effective 4.5x in memory bandwidth...

[3/6]
Read 6 tweets
May 12
Llama and many recent open-source models have a significant architectural limitation

They use multi-head attention instead of multi-query attention (which is used by PaLM and probs Claude 100K)

This can result in slowdowns of up to 30x

Heres the math behind why (1/n) twitter.com/i/web/status/1… Image
Let's consider the widely used 7B param llama architecture.

It has 32 layers, 32 heads, and d_k, d_v sizes of 128

The key issue with multi-head attention is the cost of repeatedly accessing the previously computed attention keys and values during inference.

why? (2/n) Image
How does multi-head attention work when generating the Nth token?

To generate each token, we calculate a query for each head. Then, we look at the keys and values for prev tokens and apply the op:

Softmax(q^TK/sqrt(d)) *V

where head q = (d_k,1) k=(d_k,N) and v=(d_v, N)

(3/12) Image
Read 12 tweets
May 11
The size of all code/history on Github public repos is 92TB

The size of Google's monorepo in 2015 was 86TB (of much higher quality code)

If Google were willing to deploy code models trained on their own data, they'd have a noticable advantage over everyone else. twitter.com/i/web/status/1… ImageImage
To be fair, they would have to use the diff/history data much more than the raw code.

There is almost certainly more code at Github HEAD, but Google may benefit from a richer commit history

And, I'd suspect the size of their mono repo would have substantially increased since
Furthermore, these high-quality code tokens aren't just good for code, but incredibly useful for general language modeling performance:

Read 5 tweets
May 10
Palm2 just dropped, and there are claims that the largest model is just 14.7B params.

In reality, the model is probably closer to 100B parameters

But why... (1/5) twitter.com/i/web/status/1… Image
The 14.7B param number comes from the scaling experiments they ran. The largest model in these experiments was 14.7B params

A (very rough) approximation of the curves from the paper gives something like:

n_params = sqrt(FLOPS/100)

(2/5) Image
We also know PaLM 2 used more flops than Palm, which required 2.52e24 FLOPs [1]

For that number, the optimal number of params is 160B

(3/5)
Read 6 tweets
Mar 16
Want to code using GPT-4? We made an IDE built for programming alongside it

Try out the public beta here: cursor.so
We're also thrilled to announce our partnership with @openai through their early startup program!

After partnering with them, we were given early access to GPT-4.

Since then, we've worked on designing an interface that captures how powerful these models truly are.
And now that GPT-4 is publicly available, we have a flurry of features we'll be shipping soon.

It completely changes the game on what is possible with AI-assistive programming.
Read 4 tweets
Mar 1
There are times and places for training your own models... With the release OpenAI's chatGPT API - coding is looking less like one of them.

The human-eval pass@1 rate of ChatGPT is as good as the best Open Source model's pass@100 rate.

And this is still just GPT 3.5...
Not only that, but 10x better pricing than text-davinci, and far lower latency.

After seeing this news today, I really would not want to be one of OpenAI's competitors
For those unfamiliar with pass@k, this means if I took the best open-source code model (CodeGen 16B) and sampled 100 generations, the probability that 1 of those 100 generations was correct is the same as the probability ChatGPT gets it right on the first try.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(