applied ai consultant, creator of instructor prev. staff mle @stitchfix, safety @meta, physics @uwaterloo follow me for content on rag, consulting, and life
2 subscribers
Oct 30, 2024 • 4 tweets • 2 min read
Imagine spending $10,000 in a single weekend on LLM testing.
That's exactly what happened to one of my clients when a junior engineer became overly enthusiastic with their evaluation suite. That moment served as a wake-up call.
The VP had been advocating for sophisticated LLM-based testing systems but was missing something crucial: sometimes the simplest solution is the best one. After my first week, we began to:
- Write tests that run in seconds, not hours
- Catch critical issues before they hit production
- Save thousands on evaluation costs
- Provide engineers with clear metrics for improvement
All of these changes significantly impacted how we build AI systems. In less than a month, we transitioned from burning cash on complex evaluations to having a lean, effective testing system that accurately predicts user satisfaction.
Here are the key metrics that can be implemented with simple code:
**Content Quality Checks**
- Response length (character count vs. target)
- Compression ratio (summary vs. original)
- Language consistency (detect language switches)
- Name/entity validation (against source text)
**Retrieval Quality**
- Precision at K (% of relevant chunks in top K results)
- Recall at K (% of important info retrieved)
- MRR (Mean Reciprocal Rank of the first relevant result)
**Performance Metrics**
- Response time
- Token usage
- API costs per request
- Error rates
Each of these can be implemented with basic Python code and executed in milliseconds.
**Connecting Metrics to Outcomes**
These technical metrics map to three levels of business impact:
1. **Algorithm Metrics (the ones above)**
- Run every morning
- Take seconds to execute
- Provide immediate feedback
2. **User Feedback**
- Thumbs up/down ratings
- Time spent reading
- Follow-up questions asked
- Features used (copy, share, edit)
3. **Business Outcomes**
- User retention
- Task completion rates
- Customer satisfaction scores
- Support ticket volume
This kind of test:
- Runs in milliseconds
- Provides clear yes/no results
- Can be automated
- Costs nothing to execute
Aug 27, 2024 • 5 tweets • 2 min read
Every mistake I see people make when consulting.
tips on writing better proposals
Aug 3, 2024 • 4 tweets • 1 min read
Do you want to feel better about charging more?
What should I write about in my “how to charge more” post?
I went from my full time job paying 480k a year to doing free lance into consulting into advising
I started at 170$/hr march 2023 and now in April I made 130k in one month and soon 230k in one month. I’ve been incredibly uncomfortable with charging more and want to share how my mentally changed in the past year.
Open to questions
I want to write this cause I think technical people usually are pretty allergic to sales and thinking about benefits and outcomes and ultimately are way more exploited than they realize.
Jul 14, 2024 • 5 tweets • 2 min read
So, what is a system?
A system is a structured approach to solving problems or accomplishing tasks. It's a set of organized principles, methods, or procedures that guide how we think about and tackle challenges. In the context of AI and RAG applications, my system includes:
* A framework for evaluating different technologies and tools
* A decision-making process for prioritizing development efforts
* A methodology for diagnosing and improving application performance
* A set of standardized metrics and benchmarks for measuring success
The beauty of a system is that it provides consistency and repeatability. Instead of reinventing the wheel each time you face a similar problem, you have a trusted process to fall back on. This is especially valuable in the fast-paced, often uncertain world of AI development.
A good system doesn't constrain creativity or flexibility. Rather, it provides a foundation that allows you to be more efficient with routine tasks, freeing up mental energy for innovation and tackling unique challenges.
This is what I plan on teaching you in our course
Rather than implementing whatever is the hot blog post of the day, I'm going to share what i've learned from building search systems at large companies like Meta and Stitchfix.
Share anecdotes from my consulting and draw parallels to classic search problems.maven.com/applied-llms/r…
Jul 2, 2024 • 5 tweets • 3 min read
You're working on a new AI-powered RAG application, but the process is hectic. There are many competing priorities, and not enough development time. Even if you had the time for everything, you're unsure how to improve the system. You know that somewhere in this chaotic mix is "the right path" - a sequence of actions that results in the most growth in the least amount of time. However, you're lucky if you're even going in the right direction, as each day of work feels like another roll of the dice.
To build with new AI systems, you obviously need technical skills - that's the baseline. But what separates the successful from the unsuccessful is not technical, but rather the frameworks for decision-making and resource allocation. Knowing what's worth working on, how to prioritize, what tradeoffs are worth making, what metrics to look at, and what to ignore, etc.tome.app/fivesixseven/a…
If you don't have these skills, your success entirely depends on someone above you having them and telling you exactly what to work on. You know you have to improve and make it better, but that doesn't give you a plan you can execute day-to-day. Avoid wasting engineering cycles, losing customers, or worse, never shipping.
Fortunately, these skills are not a magical trait that you either have or don't. They are a separate skill distinct from the technical skills needed for building with AI systems, and many never have the chance to learn them. But you can learn them, just like anything else. As someone who's been building recommendation systems and working with machine learning models for the past seven years, my goal is to give you the skills you need to succeed.
May 23, 2024 • 4 tweets • 2 min read
once a week i tell a founder "stop trying to finetune models, and just go sell, use opus, use 4-turbo, and just raise prices, find value, go sell, and sell to rich people,
stop selling to developers, sell to capital allocators, and not wage workers. make your roadster, get the money, and make the model 3 afterwards.
how am i signing a 6 figure contract in the month as as a solo bootstrapped twitter influencer with a suspended business account and a with an open source library and you are not!?
just promise that the thing they buy will give them status.
your note taking app is about "being a better executor"
your meetings app is about "being a better sales person"
your rag app is about "being a better decision maker"
your diligence ai agent is about "avoiding profit erosion"
Feb 17, 2024 • 4 tweets • 2 min read
Introducing Instructor Hub in 150 lines of python code
1. uses raw.githubcontent as the backend to get version control and a cdn (serverless lol) 2. uses pytest-examples to lint and test every example (never merge bag code to hub) 3. you can view cookbooks from the cli and pull code directly to disk.
why?
This means that all the code you pull is linted and tested, and match up 1:1 to the documentation, everything in the hub.
Which means you own all of the code, no magic code, just python and pydantic, and openai.
Theres lots to do, but none of it is needed yet. once we get > 30 items we'll implement search.
github.com/jxnl/instructo…
Feb 8, 2024 • 4 tweets • 1 min read
Guaranteed structure output with Ollama and Pydantic.
Check out the blog post to learn more about @pydantic and @ollama
trust , i've training models and deployed production applications that serve at >350M requests a day.
just need `pip install` and some good naming conventions
1. jinja - prompting frameworks 2. numpy - vector search 3. sqlite - evals, one row per exp 4. boto3 - data management, s3 and some folder structure
??? 5. google sheets ;) - experiment tracking w/ a link to the artifacts saved in S3/GCS.
Disagree?
I've been training models in @PyTorch , and deploying them via @FastAPI since the library came out!
we did large image classification tasks where the folder structure reflected class labels and had a config.json in each directory.
our early a/b tests exported to google sheets and we served similar item recommendations via numpy brute force 3M skus with 40 dimension per vector (umap over resnet and matrix factorization machines)
0/ Any real AI engineer knows that streaming REALLY improves the UX.
Today, I'm landing a change that defines a reliable way to stream out multiple @pydantic objects from @OpenAI .
Take a look, by the end, you'll know how to do streaming extraction and why it matters. 1/ Streaming is critical when building applications where the UI is generated by the AI.
Notice in the screenshot that the first item was returned in 560ms but the last one in almost 2000ms! a 4x difference in time to first content
How do we do this?
Jun 29, 2023 • 6 tweets • 2 min read
Why prompt engineer @openai with strings?
Don't even make it string, or a dag, make it a pipeline.
Single level of abstraction:
Tool and Prompt and Context and Technique? its the same thing, it a description of what I want.
The code is the prompt. None of this shit
"{}{{}} {}}".format{"{}{}"
If you've followed me from the last @LangChainAI webinar I wanted to share the repo that contains the code examples. Contributions of other ideas / evals / or examples are totally welcome. If you want to help you check the issues!
1/ ✨Constructing recursive data structures using @OpenAI's function call and @pydantic (part 1)
I wanted to share a little exploration I’ve been doing. This approach has the potential to change how we handle complex query routing, thinking, and planning in LLMs.
It enables the specification and parsing of hierarchical graph-like data by leveraging @pydantic's recursive definition and JSON schema, we can now define and work with complex hierarchical data structures more easily.
Jun 18, 2023 • 5 tweets • 2 min read
A teaser of some of the stuff i'll talk about at the @LangChainAI webinar. notice that my 'citations' are the spans of strings, not 'page number' or 'chunk_id' :)
Another teaser, but notice that Note has children with are also nodes. can you figure out whats going on?
Jun 14, 2023 • 8 tweets • 3 min read
Some tips on using function calls in the new @OpenAI release:
It's really a naming issue where a function call conflates structured json with tool use.
I'm going to got through some examples of just using function calls to extract json w/ @pydantic
If you don't want to use json schema you can use python to define type safe schemas