Aman Sanger Profile picture
building @cursor_ai at @anysphere https://t.co/EdcQJ2dv0J | https://t.co/vJ5zNuT6WO
Potato Of Reason Profile picture Jerome Ku Profile picture evolvingspirals Profile picture Gary Allan Davis Jr Profile picture 4 subscribed
Mar 26 11 tweets 2 min read
Long context models with massive custom prompts (~2M) may soon replace fine-tuning for new knowledge!

Let’s explore why:
(1/10) The fine-tuning we care about is learning new, useful information not in pretraining.

For example: a company’s codebase or internal documentation.
(2/10)
Mar 5 13 tweets 3 min read
With a 256K token prompt, a 7b model can generate tokens as quickly as codellama-7b with an 8K prompt.

How? The model must use multi-query attention.

Here's why...
(1/10) For large context windows and large batch inference, the bottleneck for generation speed is bottlenecked by KV cache size.

To illustrate this, let’s look at our 8K context window with multi-head attention at a batch size of 16.
(2/10)
Jan 12 13 tweets 3 min read
One magical part of Cursor’s internal tech stack is a prompt compilation library called priompt ()

Here's why works so well... (1/12) github.com/anysphere/prio…
Standard prompting libraries use variants of “f-strings” with subbed-in inputs.

For us, a prompt is defined as a function that maps some set of inputs X and a token budget n to some string, s:

p(X, n) = s

We call this operation "rendering"

(2/12)
Dec 5, 2023 11 tweets 2 min read
At Cursor, we've built very high-quality retrieval datasets (for training embeddings/rerankers).

To do this, we use GPT-4 grading and the Trueskill ratings system (a better version of Elo)

Here’s how.. (1/10) We start with an out-of-the-box dataset of coding queries and get the top 100 ada embedding results in their repositories.

But we need much better ground truth labels than cosine similarity.
(2/10)
Dec 2, 2023 11 tweets 3 min read
After switching our vector db to @turbopuffer, we're saving an order of magnitude in costs and dealing with far less complexity!

Here's why...
(1/10)
We've seen two key advantages of Turbopuffer with no perf degradation:

1. Normal vector database pricing makes no sense for our workloads (lots of moderate-sized indices).
2. The normal “pods” or cluster-based indices (of Pinecone for example) add unnecessary complexity

(2/10)
Nov 28, 2023 8 tweets 2 min read
People claim LLM knowledge distillation is trivial with logprobs, but that's not quite right...

It's very tricky to distill between different tokenizers. [1]

Internally, we've solved this with a clever algorithm we called tokenization transfer
(1/7) To start, we needed to build a sophisticated primitive called the "Logmass Trie"

It's an extended Trie where each edge not only contains a character but a weight that represents the "log probability" of that character conditional on the string thus far
(2/7)
Nov 27, 2023 7 tweets 1 min read
Though @cursor_ai is powered by standard retrieval pipelines today, we've been working on something much better called:

Deep Context

After @walden_yan built an early version for our vscode fork, Q&A accuracy skyrocketed.

Soon, we're bringing this to everyone (1/6) The fundamental issue with RAG is that it encourages shallow answers

Even excellent engineers would struggle to answer a question with only RAG-like context about a codebase they've never seen. (2/6)
Nov 26, 2023 12 tweets 2 min read
At @cursor_ai, we’ve scaled throughput on GPT-4 to 2-3x over baseline without access to knobs in OpenAI’s dedicated instances [1]

We did this by reverse-engineering expected GPT-4 latency and memory usage from first principles.

Here’s how... (1/10) First, all of this is only possible with OpenAI's dedicated capacity. For large enough orgs with high usage, this is a no-brainer for cost reasons.

Dedicated capacity lets you commit to some usage for an extended period of time for reduced pricing.
(2/10)
Nov 23, 2023 10 tweets 3 min read
There are some interesting optimizations to consider when running retrieval at scale (in @cursor_ai's case, hundreds of thousands of codebases)

For example, reranking 500K tokens per query

With blob-storage KV-caching and pipelining, it's possible to make this 20x cheaper (1/8) By default, we use a fine-tuned 7b param codellama reranker.

With $3/hr A100s, reranking 500K tokens will cost us 500K * (7e9 * 2) FLOPs.

Assuming 8-bit quantization, we can achieve at least 300 TFLOPs, meaning $0.019/query.

This is infeasibly expensive to run at scale (2/8)
Aug 12, 2023 4 tweets 1 min read
Sub 600ms latency speech conversational AI is completely possible today, surprised I haven’t seen anyone that does this.

The key is hosting a model (like llama), streaming from whisper, and every few tokens, prefilling more of the kv cache - without evicting from memory (1/4) Whisper could transcribe the remainder of the text after speaking in 100ms.

This means the time to first voice response is around the time to process the last several tokens of input, then generate the first 20 tokens, then pass that into a text to speech model. (2/4))
May 18, 2023 6 tweets 2 min read
Palm2 has been leaked to be 340B params and trained on 3.6T tokens (7.4e24 FLOPs).

Someone out there could feasibly reproduce a similar quality model...

for under $6M!

But that price tag largely depends on H100s...

[1/6] twitter.com/i/web/status/1… Image First, let's see what happens when we try to use A100s.

At this model scale, A100s can hit 60%+ of their max flop count (312 TFLOPs).

So, running on A100s would cost (7.4e24/(312*.6)e12 * (3600s/hr)) = 11M A100 hours.

Given (dirt-cheap) $1/hr A100 pricing, that's $11M

[2/6]
May 12, 2023 12 tweets 4 min read
Llama and many recent open-source models have a significant architectural limitation

They use multi-head attention instead of multi-query attention (which is used by PaLM and probs Claude 100K)

This can result in slowdowns of up to 30x

Heres the math behind why (1/n) twitter.com/i/web/status/1… Image Let's consider the widely used 7B param llama architecture.

It has 32 layers, 32 heads, and d_k, d_v sizes of 128

The key issue with multi-head attention is the cost of repeatedly accessing the previously computed attention keys and values during inference.

why? (2/n) Image
May 11, 2023 5 tweets 2 min read
The size of all code/history on Github public repos is 92TB

The size of Google's monorepo in 2015 was 86TB (of much higher quality code)

If Google were willing to deploy code models trained on their own data, they'd have a noticable advantage over everyone else. twitter.com/i/web/status/1… ImageImage To be fair, they would have to use the diff/history data much more than the raw code.

There is almost certainly more code at Github HEAD, but Google may benefit from a richer commit history

And, I'd suspect the size of their mono repo would have substantially increased since
May 10, 2023 6 tweets 2 min read
Palm2 just dropped, and there are claims that the largest model is just 14.7B params.

In reality, the model is probably closer to 100B parameters

But why... (1/5) twitter.com/i/web/status/1… Image The 14.7B param number comes from the scaling experiments they ran. The largest model in these experiments was 14.7B params

A (very rough) approximation of the curves from the paper gives something like:

n_params = sqrt(FLOPS/100)

(2/5) Image
Mar 16, 2023 4 tweets 2 min read
Want to code using GPT-4? We made an IDE built for programming alongside it

Try out the public beta here: cursor.so We're also thrilled to announce our partnership with @openai through their early startup program!

After partnering with them, we were given early access to GPT-4.

Since then, we've worked on designing an interface that captures how powerful these models truly are.
Mar 1, 2023 4 tweets 1 min read
There are times and places for training your own models... With the release OpenAI's chatGPT API - coding is looking less like one of them.

The human-eval pass@1 rate of ChatGPT is as good as the best Open Source model's pass@100 rate.

And this is still just GPT 3.5... Not only that, but 10x better pricing than text-davinci, and far lower latency.

After seeing this news today, I really would not want to be one of OpenAI's competitors
Jan 18, 2023 7 tweets 3 min read
Introducing Cursor!! (cursor.so)

A brand new IDE build from the ground up with LLMs. Watch us use Cursor to ship new features blazingly fast. Just press Cmd+k to instruct the IDE. Enter anything and watch Cursor do its magic
Nov 11, 2022 4 tweets 1 min read
My bet is that in the long run, reading and writing to external memory is key for much more capable models that can continually learn.

Someone will make the Neural Turing Machine work with a transformer backbone. (1/4) Retrieval Transformers (RETRO) are a first step at this, but they feel more like giving an LM access to a search api over some controllable dataset.

They are read-only, require massive data banks that need to be manually updated, and are limited by the frozen bert encoder. (2/4)
Nov 4, 2022 8 tweets 3 min read
For those using open-source models like CodeGen instead of OpenAI's Codex, I have some bad news about its "comparable" performance.

It isn’t even close anymore. code-davinci 1-shot is competitive with CodeGen 10-shot. (1/7)

(bottom results computed by me with OpenAI API) Image The earlier version of Codex was only finetuned on 159GB of python Code (probably just 50B tokens).

Though CodeGen was pretrained on BigQuery, it was finetuned on just BigPython - a 217GB dataset of permissively licensed GitHub code. (2/7)
Oct 14, 2022 5 tweets 1 min read
The best bets to make on short timelines (assuming no x-risk) are not necessarily through shares of tech companies (1/5) The government will not allow companies to have monopolies on intelligence. A google will not 10x profits without government intervention/regulation. And bets on individual companies (other than Nvidia imo) can be quite risky. (2/5)
May 17, 2022 13 tweets 3 min read
The age of pure software is over.

As predicted, software has eaten the world. But AI will subsume it.

The next wave of generational companies will be pure AI-shops (0/n) Tech falls into 4 eras: Mainframes, chips, personal computers, software (pre-web, Web1/2, SAAS)

In each era, there were tailwinds for company building. Later, these tailwinds became headwinds - you were competing against well-established incumbents in saturated mkts. (1/n)