I used GPT-4-32K (+ other models) to analyze hundreds of files and explain how @Twitter's open-source algorithm works.
Now, I'm sharing the code I used, so you can do this on ANY Github repo!
Here's the AI's explanation, my approach, and the code for your own use:
Before I show you the AI's explanation, let me explain how it works:
First, I used this awesome repo (github.com/mpoon/gpt-repo…) to flatten the Twitter algorithm into a single text file.
Then, I uploaded that file to Colab, and split it up into hundreds of chunked strings, small enough for GPT-3.5-Turbo to process.
Why GPT-3.5-Turbo?
Cost.
The Twitter algorithm is over 5 million tokens.
If I ran this step through GPT-4, I'd go broke.
Next, I used GPT-3.5-Turbo to summarize each chunk, keeping only the important details.
I combined those chunks into one long string, but it still was nearly half a million tokens long.
So, I repeated the process above, and broke that string up into chunks.
This time, the costs were feasible, so I made the chunks long enough to take advantage of GPT-4-32K.
Then, I summarized the chunks using GPT-4-32K, and combined them into a string.
Finally, I had a string that could be passed in its entirety to GPT-4-32K.
I did just that, and asked it for an explanation of how Twitter's algorithm works.
Here's what it gave me (note, Twitter is making me break this up into 280-character sections):
The Twitter algorithm connects multiple components and techniques to deliver a personalized user experience. Data processing and feature extraction lay the foundation by organizing various data types and identifying key attributes for content curation.
Machine learning models predict user engagement, while the real-time search engine ensures relevant and timely results. The recommendation system identifies and ranks potential content, and the Home Mixer integrates these components to create personalized home timelines.
Load testing and performance evaluation components ensure the system's efficiency and scalability. Through seamless collaboration, these elements work together to analyze vast amounts of data, curate relevant content, and present a tailored, engaging experience for Twitter users.
Here's a detailed yet concise overview of its workings:
1. **Data Processing**: The algorithm handles data types (tweets, user info, media) and events (client, social graph, user modification, email notification, ads callback engagements) using Kafka.
2. **Feature Extraction**: It extracts features (tweet content, user interests, engagement) from data to evaluate tweet visibility and tailor recommendations.
3. **Machine Learning Models**: Lightweight linear models and TensorFlow predict user engagement and rank tweets based on relevance, using data from external services like User Signal Service.
4. **Real-time Search**: A search engine with Lucene index and faceted search manages, searches, and updates index segments, optimizing search queries and results.
5. **Recommendation System**: A bipartite graph generates candidates, employing algorithms like GatingConfig and EarlybirdSimilarityEngine to filter and rank based on recency, popularity, and similarity.
6. **Home Mixer**: This framework fetches tweets from sources like Conversation Service, UTEG, FRS, processes them through transformers and filters, and ranks them for personalized home timelines.
7. **Load Testing & Performance Evaluation**: Components like EmbeddingSamplingJob, KnnTruthSetGenerator, and AnnLoadTestWorker ensure system performance and scalability, and test the Approximate Nearest Neighbor (ANN) query service.
8. **Configurations**: The algorithm provides settings for different environments, clusters, and Kafka configurations.
In summary, the Twitter algorithm uses diverse techniques like machine learning, real-time search, and recommendation systems to process and analyze data, delivering an engaging and tailored experience for each user.
As you can see, it's not perfect. Not even close.
I wrote this code quickly, with the help of GPT-4.
With a few small improvements, the results could be really, really great.
If someone from Twitter can weigh in on how the AI did, I'd be interested to hear... cc @elonmusk :)
Determining how much to raise, calculating and understanding dilution, different structures (SAFEs, priced rounds, crowdfunding, etc.), meeting etiquette.