Mark Tenenholtz Profile picture
Apr 27 23 tweets 6 min read Twitter logo Read on Twitter
I built a ChatGPT app that lets you chat with any codebase!

99% of projects just copy/paste Langchain tutorials. This goes well beyond that.

Here's how I built it:
I built it to work with the Twitter codebase, but it's effortless to swap in any other repository.

1. Embedding the code
2. Query + prompt the model
3. Pull in relevant context
4. Repeat steps 2+3

That's all created with a Pinecone vector DB and a FastAPI backend.

Let's start!
1. Embedding the code

It's 82 LOC, but it could probably be ~50-60.

In short, the process is:

1. Pull the codebase from GitHub
2. Iterate over the files, ignoring images, etc.
3. Split each file into chunks
4. Embed each chunk

Pretty standard, but I added a couple twists. Image
Rather than storing the code on disk, I just loaded it into memory and looped over the zip archive.

I also added a loading bar w/ token counting for cost estimation.

You could comment out the embedding if you just wanted the estimate.

Now we need to split the files.
99% of tutorials just split on characters.

But, code is very structured and newline tokens are very useful to split on.

Langchain's RecursiveCharacterTextSplitter helps us with that.
It prefers first to split on double newlines, then single newlines, and only breaking up words as a last resort.

That means that functions, variable groups, imports, and other "like" code usually stay together nicely without unnecessary breaks.

Now we need to embed the chunks.
I'm using OpenAI Ada embeddings.

There are *huge* tradeoffs with this.

Using an API is very advantageous because we don't have to manage/pay for the GPUs to run our own embeddings.

Even better, we don't have to manage a GPU server to embed queries using the same model.

But...
Ada embeddings are notably inferior to purpose-built, encoder-only or encoder-decoder models.

Models like ColBERT and hybrid methods are the state-of-the-art (check out this thread for more )

You have to decide what makes more sense for your application.
Finally, we throw those embeddings in a Pinecone index.

Some thoughts here:

1. You don't always need a vector DB. I just felt like using one here. np.array could be sufficient for you.

2. I chose to store metadata+full text in the vector DB. This is probably a bad idea (cont)
(cont) Instead, larger apps should store a UUID as a primary key and use that to query a traditional DB. Storage is precious!

3. I chose Pinecone b/c they have a free tier. Chroma for local embeddings or something else like DeepLake also works.

Now to #2: building the prompt.
First, we need to build a system message.

(full prompt in the code at the end)

This will guide the LLM and provide it with the most relevant context for the initial query.

We want to make sure that it doesn't use outside knowledge and that it properly formats any code. Image
But, we don't just want to slap the most similar code to the question and call it a day.

We want to make sure that we tell ChatGPT where every code snippet comes from.

That lets the model reference its sources later on. Image
By default, I have it pulling 10 relevant code snippets to built the system message. That'll work well for the first message.

But remember, this is a chat application.

If a user asks a different follow-up question, we'll have to add more context (we'll see that later).
Great, we've done all 3 major steps!

Now we need to orchestrate it into a chatting function.

Here's the code I used: Image
That looks like a lot, but it's only because of the complexity of building an asynchronous streaming endpoint.

(I wrote a whole thread on that here: )

Let's go over just the important parts for prompting ChatGPT.
First, I decided to keep as many of the human questions as possible.

If we're just seeing the first question, then we only need to initial question from the user.

If not, we need to keep the system message and up to 750 tokens of user questions (token limits!).

But why? Image
Well, we want to make sure that, in case the conversation changes directions, we always add more relevant code snippets to the conversation.

Adding the previous queries helps that by contextualizing the user question with more meaning.

This is a design decision, but it helped!
For clarity:

This just means that new code snippets are the ones that are most similar to the concatenation of all previous user queries, not just the most recent.

But, we won't necessarily use all of these messages for question answering.

That comes next!
Now, we need to decide what context to keep for question answering.

Here's the code that determines which messages to include in our chat history.

We need to:

1. Not go over the token limit
2. Avoid losing important context Image
First, we need to figure out how many tokens are in the two messages that we *must* keep: the system message (with our initial code snippets) and the latest user question (with added context).

After that, we want to add in messages in the middle one at a time.
Honestly, there are a bunch of ways of doing this.

I chose to keep as many tokens worth of the most recent other messages (to leave room for the system message, query, and newly-added context). Our limit is ~4000 for gpt-3.5-turbo.

There are fancier ways to do this, though.
For instance, you could keep the most similar context to the newest question using embeddings.

This filtered chat history is what's actually fed into ChatGPT.

Finally, we call the OpenAI API, and we're off!
The whole process repeats, never going over the token limit + never losing the system message.

If you're curious to see the full code, it's right here (in the backend directory): github.com/mtenenholtz/ch…

It's also hosted here: chat-twitter.vercel.app

Let me know what you think!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Mark Tenenholtz

Mark Tenenholtz Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @marktenenholtz

Apr 26
Hard truth:

Simple LLM search methods get crushed by traditional lexical search.

LLMs crash and burn on out-of-domain data, which is why the current state-of-the-art is a hybrid method.

The fundamentals of lexical search are still crucial to understand.

Here's where to start:
The BEIR benchmark is the gold standard for evaluating search methods.

One of the most commonly used lexical search methods (BM25) beats every other transformer they benchmarked, except ColBERT.

But, by far the best method was BM25 plus a cross-encoder LM.

It works like this: Image
The method works in two stages.

First, the lexical search grabs 100 potential candidates.

Then, the cross-encoder "re-ranks" those candidates. In other words, it curates the best options that the lexical search method picks.

But how does lexical search work? Image
Read 9 tweets
Apr 25
I am at least 3-5x more productive using ChatGPT to code.

Not only am I faster writing code I'm familiar with, but I've even shipped apps in tech stacks I'd never used before.

Here's my process, the prompts I use, and why it all works:
1. Start in ChatGPT

ChatGPT is really good at the basics and getting setup out of the way.

These steps are the most abundant in its training data, since they're pretty common across projects

My prompt is usually something like: Image
"Write a [describe the tech stack] app using [any auxiliary tech like a database]. I need it to [describe the most important requirements]."

For instance:
Read 21 tweets
Apr 24
TiDE: Time-series Dense Encoder

Claims to be 5-10x faster (and 6% more accurate) than Transformers (PatchTST) with much longer history due to linear scaling.

It’s an MLP-based encoder-decoder architecture that leverages a custom distribution loss. 👇 Image
The most common loss function in time-series models is MSE.

But we figured out a while ago that directly optimizing the distribution of your data using MLE is better.

Zero-inflated losses like Tweedie can be great for tasks like item count forecasting.

TiDE is a bit fancier.
TiDE uses a combination of 3 negative binomial distributions (another zero inflated distribution).

The optimal mixture is learned at training, automatically conforming to the distribution of the training data. Image
Read 4 tweets
Apr 22
"Kaggle doesn't translate to real life" says the data scientist fitting models with an AutoML tool designed by a team full of Kaggle Grandmasters.
"Kaggle doesn't translate to real life" says the data scientist who learned 3 days ago that SVMs on embeddings work really well.
"Kaggle doesn't translate to real life" says the data scientist who was introduced to LLMs by GPT-3.
Read 4 tweets
Apr 21
For my next open-source app, I was inspired by @karpathy to do a simple movie recommender using OpenAI's embeddings.

A couple quick observations:
Right now, it's just matching based on their plot summaries.

(Caveat: I'm testing on a small sample size of a few hundred movies)

1. Sometimes, it clusters based on undesirable attributes Image
Above are the movies similar to the most recent Spider-Man movie.

You can see above that "To All the Boys: Always and Forever" is in there, presumably only because there's a character named Peter.

Not great!
Read 7 tweets
Apr 19
I've seen 1,000,000 LLM demos and "AI influencer" posts.

But none of them showed me how to actually deploy a real app.

So, I built a project (open source) to figure it out, and it only uses 1 CPU and 1 GB RAM.

Here's how I built a simple ChatGPT streaming backend in FastAPI:
This isn't a perfect backend. In fact, it can be significantly improved.

But, it allows me to use an extremely cheap VM to host it.

My traffic hasn't been much, but I also haven't even exceeded 1% CPU usage at any point.

Look closely and you'll see the utilization 😂 Image
If you haven't built HTTP APIs before, you might be wondering:

"If you're streaming the response, how come your utilization isn't higher? You have to loop over the whole reply and keep the connection open!"

This is where FastAPI comes in!
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(