Imagine you have a ton of data, but most of it isn't labeled. Even worse: labeling is very expensive. 😑
How can we get past this problem?
Let's talk about a different—and pretty cool—way to train a machine learning model.
☕️👇
Let's say we want to classify videos in terms of maturity level. We have millions of them, but only a few have labels.
Labeling a video takes a long time (you have to watch it in full!) We also don't know how many videos we need to build a good model.
[2 / 9]
In a traditional supervised approach, we don't have a choice: we need to spend the time and come up with a large dataset of labeled videos to train our model.
But this isn't always an option.
In some cases, this may be the end of the project. 😟
[3 / 9]
Here is a different approach: Active Learning.
Using Active Learning, we can have our algorithm start training with the data it has and interactively ask for new labeled data as it needs it.
Active Learning is a semi-supervised learning method.
[4 / 9]
Here is the most important part of "Active Learning":
The algorithm will look at all the unlabeled data and will pick the most informative examples. Then, it will ask humans to label those examples and use the answers as part of the training process.
[5 / 9]
Determining which examples are the most informative is the problematic part.
Worse case, we can select unlabeled examples randomly, but that wouldn't be smart.
The better the selection process is, the less data you'll need to build a model.
[6 / 9]
When deciding, we want the algorithm to pick the most challenging examples for the model.
Here are some existing methods that you can research further:
If you have a list of things you've always wanted to solve, let an agent do them:
• Refactor code and ensure tests still run
• Find and fix bugs
• Close open tickets from your backlog
• Update documentation
• Write tests for untested code
• You can use it with any of the major models (GPT-X, Gemini, Claude)
• It has an option to Chat and Edit with the model
• It has an Agent mode to make changes to the notebook autonomously
Knowledge graphs are a game-changer for AI Agents, and this is one example of how you can take advantage of them.
How this works:
1. Cursor connects to Graphiti's MCP Server. Graphiti is a very popular open-source Knowledge Graph library for AI agents.
2. Graphiti connects to Neo4j running locally.
Now, every time I interact with Cursor, the information is synthesized and stored in the knowledge graph. In short, Cursor now "remembers" everything about our project.
Huge!
Here is the video I recorded.
To get this working on your computer, follow the instructions on this link:
Something super cool about using Graphiti's MCP server:
You can use one model to develop the requirements and a completely different model to implement the code. This is a huge plus because you could use the stronger model at each stage.
Also, Graphiti supports custom entities, which you can use when running the MCP server.
You can use these custom entities to structure and recall domain-specific information, which will tenfold the accuracy of your results.