jason liu - vacation mode Profile picture
applied ai consultant, creator of instructor prev. staff mle @stitchfix, safety @meta, physics @uwaterloo follow me for content on rag, consulting, and life
2 subscribers
Oct 30, 2024 4 tweets 2 min read
Imagine spending $10,000 in a single weekend on LLM testing.

That's exactly what happened to one of my clients when a junior engineer became overly enthusiastic with their evaluation suite. That moment served as a wake-up call.

The VP had been advocating for sophisticated LLM-based testing systems but was missing something crucial: sometimes the simplest solution is the best one. After my first week, we began to:

- Write tests that run in seconds, not hours
- Catch critical issues before they hit production
- Save thousands on evaluation costs
- Provide engineers with clear metrics for improvement

All of these changes significantly impacted how we build AI systems. In less than a month, we transitioned from burning cash on complex evaluations to having a lean, effective testing system that accurately predicts user satisfaction. Here are the key metrics that can be implemented with simple code:

**Format Consistency Tests**
- Markdown heading structure (regex for H1, H2, etc.)
- List formatting (bullet points, numbering)
- Code block detection
- Table structure validation

**Content Quality Checks**
- Response length (character count vs. target)
- Compression ratio (summary vs. original)
- Language consistency (detect language switches)
- Name/entity validation (against source text)

**Retrieval Quality**
- Precision at K (% of relevant chunks in top K results)
- Recall at K (% of important info retrieved)
- MRR (Mean Reciprocal Rank of the first relevant result)

**Performance Metrics**
- Response time
- Token usage
- API costs per request
- Error rates

Each of these can be implemented with basic Python code and executed in milliseconds.

**Connecting Metrics to Outcomes**
These technical metrics map to three levels of business impact:

1. **Algorithm Metrics (the ones above)**
- Run every morning
- Take seconds to execute
- Provide immediate feedback

2. **User Feedback**
- Thumbs up/down ratings
- Time spent reading
- Follow-up questions asked
- Features used (copy, share, edit)

3. **Business Outcomes**
- User retention
- Task completion rates
- Customer satisfaction scores
- Support ticket volume

This kind of test:

- Runs in milliseconds
- Provides clear yes/no results
- Can be automated
- Costs nothing to execute
Aug 27, 2024 5 tweets 2 min read
Every mistake I see people make when consulting. Image tips on writing better proposals Image
Aug 3, 2024 4 tweets 1 min read
Do you want to feel better about charging more?

What should I write about in my “how to charge more” post?

I went from my full time job paying 480k a year to doing free lance into consulting into advising

I started at 170$/hr march 2023 and now in April I made 130k in one month and soon 230k in one month. I’ve been incredibly uncomfortable with charging more and want to share how my mentally changed in the past year.

Open to questions I want to write this cause I think technical people usually are pretty allergic to sales and thinking about benefits and outcomes and ultimately are way more exploited than they realize.
Jul 14, 2024 5 tweets 2 min read
So, what is a system?

A system is a structured approach to solving problems or accomplishing tasks. It's a set of organized principles, methods, or procedures that guide how we think about and tackle challenges. In the context of AI and RAG applications, my system includes:

* A framework for evaluating different technologies and tools
* A decision-making process for prioritizing development efforts
* A methodology for diagnosing and improving application performance
* A set of standardized metrics and benchmarks for measuring success

The beauty of a system is that it provides consistency and repeatability. Instead of reinventing the wheel each time you face a similar problem, you have a trusted process to fall back on. This is especially valuable in the fast-paced, often uncertain world of AI development.

A good system doesn't constrain creativity or flexibility. Rather, it provides a foundation that allows you to be more efficient with routine tasks, freeing up mental energy for innovation and tackling unique challenges. This is what I plan on teaching you in our course

Rather than implementing whatever is the hot blog post of the day, I'm going to share what i've learned from building search systems at large companies like Meta and Stitchfix.

Share anecdotes from my consulting and draw parallels to classic search problems.maven.com/applied-llms/r…
Jul 2, 2024 5 tweets 3 min read
You're working on a new AI-powered RAG application, but the process is hectic. There are many competing priorities, and not enough development time. Even if you had the time for everything, you're unsure how to improve the system. You know that somewhere in this chaotic mix is "the right path" - a sequence of actions that results in the most growth in the least amount of time. However, you're lucky if you're even going in the right direction, as each day of work feels like another roll of the dice.



To build with new AI systems, you obviously need technical skills - that's the baseline. But what separates the successful from the unsuccessful is not technical, but rather the frameworks for decision-making and resource allocation. Knowing what's worth working on, how to prioritize, what tradeoffs are worth making, what metrics to look at, and what to ignore, etc.tome.app/fivesixseven/a… If you don't have these skills, your success entirely depends on someone above you having them and telling you exactly what to work on. You know you have to improve and make it better, but that doesn't give you a plan you can execute day-to-day. Avoid wasting engineering cycles, losing customers, or worse, never shipping.

Fortunately, these skills are not a magical trait that you either have or don't. They are a separate skill distinct from the technical skills needed for building with AI systems, and many never have the chance to learn them. But you can learn them, just like anything else. As someone who's been building recommendation systems and working with machine learning models for the past seven years, my goal is to give you the skills you need to succeed.
May 23, 2024 4 tweets 2 min read
once a week i tell a founder "stop trying to finetune models, and just go sell, use opus, use 4-turbo, and just raise prices, find value, go sell, and sell to rich people,

stop selling to developers, sell to capital allocators, and not wage workers. make your roadster, get the money, and make the model 3 afterwards.

how am i signing a 6 figure contract in the month as as a solo bootstrapped twitter influencer with a suspended business account and a with an open source library and you are not!? just promise that the thing they buy will give them status.

your note taking app is about "being a better executor"

your meetings app is about "being a better sales person"

your rag app is about "being a better decision maker"

your diligence ai agent is about "avoiding profit erosion"
Feb 17, 2024 4 tweets 2 min read
Introducing Instructor Hub in 150 lines of python code

1. uses raw.githubcontent as the backend to get version control and a cdn (serverless lol)
2. uses pytest-examples to lint and test every example (never merge bag code to hub)
3. you can view cookbooks from the cli and pull code directly to disk.

why?

This means that all the code you pull is linted and tested, and match up 1:1 to the documentation, everything in the hub.

Which means you own all of the code, no magic code, just python and pydantic, and openai.

Theres lots to do, but none of it is needed yet. once we get > 30 items we'll implement search. github.com/jxnl/instructo…
Feb 8, 2024 4 tweets 1 min read
Guaranteed structure output with Ollama and Pydantic. Check out the blog post to learn more about @pydantic and @ollama

jxnl.github.io/instructor/blo…
Dec 31, 2023 4 tweets 2 min read
if you know python, you don't need any LLMops.

trust , i've training models and deployed production applications that serve at >350M requests a day.

just need `pip install` and some good naming conventions

1. jinja - prompting frameworks
2. numpy - vector search
3. sqlite - evals, one row per exp
4. boto3 - data management, s3 and some folder structure
???
5. google sheets ;) - experiment tracking w/ a link to the artifacts saved in S3/GCS.

Disagree? I've been training models in @PyTorch , and deploying them via @FastAPI since the library came out!

we did large image classification tasks where the folder structure reflected class labels and had a config.json in each directory.

our early a/b tests exported to google sheets and we served similar item recommendations via numpy brute force 3M skus with 40 dimension per vector (umap over resnet and matrix factorization machines)

newsroom.stitchfix.com/blog/your-shop…
Oct 12, 2023 7 tweets 2 min read
For those who couldn't attend my 'Keynote' at @aiDotEngineer

> Pydantic is all you need

Making LLMs, software 3.0, backwards compatible with software 1.0



other links in the thread!youtube.com/live/veShHxQYP… Presentation Slides:

tome.app/fivesixseven/p…
Jul 18, 2023 7 tweets 3 min read
0/ Any real AI engineer knows that streaming REALLY improves the UX.

Today, I'm landing a change that defines a reliable way to stream out multiple @pydantic objects from @OpenAI .

Take a look, by the end, you'll know how to do streaming extraction and why it matters. Image 1/ Streaming is critical when building applications where the UI is generated by the AI.

Notice in the screenshot that the first item was returned in 560ms but the last one in almost 2000ms! a 4x difference in time to first content

How do we do this? Image
Jun 29, 2023 6 tweets 2 min read
Why prompt engineer @openai with strings?

Don't even make it string, or a dag, make it a pipeline.

Single level of abstraction:

Tool and Prompt and Context and Technique? its the same thing, it a description of what I want.

The code is the prompt. None of this shit
"{}{{}} {}}".format{"{}{}"

PR in the next tweet. @OpenAI github.com/jxnl/openai_fu…
Jun 22, 2023 5 tweets 2 min read
If you've followed me from the last @LangChainAI webinar I wanted to share the repo that contains the code examples. Contributions of other ideas / evals / or examples are totally welcome. If you want to help you check the issues!



🧵github.com/jxnl/openai_fu… I'm not trying to make a framework!

Its my exploration on abstractions and design patterns, something that can be leveraged (and has been) in both @LangChainAI & @llama_index

It's also a public notebook of sorts to capture some ideas I have in my mind.
Jun 21, 2023 7 tweets 3 min read
1/ Defending from SQL injections w/ @OpenAI

I tried to add another layer of protection for sql writing systems by not just returning the string, but the template, and parameters separately.

So lets use @pydantic again to build a small example of going above SQL strings! 2/ So again, lets begin with the schema:

1) we define literals and identifiers as enum
2) we define what a parameter is so we can template
3) lastly we ask for the string template to use

as a nice to have we try to mark it safe which could be used to warn, or hook into UI.
Jun 20, 2023 6 tweets 4 min read
1/ 🤖 Towards making better planners (agents?)

In a previous post, I introduced recursive schemas using @pydantic and @OpenAI, showcasing their potential to create a directed acyclic graph (DAG).

I got it to work, but I encountered some challenges and had to make adjustments.… twitter.com/i/web/status/1… Image 2/ 🌀 The first trick: Removing recursion:

Surprisingly, removing self reference and utilizing IDs with a dependency list in a flat structure significantly improved performance.

More surprisingly, I realized this after moving to gpt4 and it insisting on this schema rather than… twitter.com/i/web/status/1… Image
Jun 20, 2023 9 tweets 4 min read
1/ 😧 How to cite your sources for LLMS

I think the way we need to do citation needs to change as our LLM powered QA systems start solving higher stakes tasks.

This thread is how we can do far better than that!

Heres an example notice that what I cite is in <>, moreover, the… twitter.com/i/web/status/1… Image 2/ How does it work now?

Currently, citations involve matching retrieved text chunk IDs to generations in two ways:

1. Identifying sources: "The author attended uwaterloo and was the president of the data science club" (Source: 2, 5, 23).

2. Citations: "The author attended… twitter.com/i/web/status/1… Image
Jun 19, 2023 8 tweets 4 min read
1/ ✨Constructing recursive data structures using @OpenAI's function call and @pydantic (part 1)

I wanted to share a little exploration I’ve been doing. This approach has the potential to change how we handle complex query routing, thinking, and planning in LLMs.

But first I'd… twitter.com/i/web/status/1… Image 2/ Why is this cool?

It enables the specification and parsing of hierarchical graph-like data by leveraging @pydantic's recursive definition and JSON schema, we can now define and work with complex hierarchical data structures more easily. Image
Jun 18, 2023 5 tweets 2 min read
A teaser of some of the stuff i'll talk about at the @LangChainAI webinar. notice that my 'citations' are the spans of strings, not 'page number' or 'chunk_id' :) ImageImage Another teaser, but notice that Note has children with are also nodes. can you figure out whats going on? ImageImage
Jun 14, 2023 8 tweets 3 min read
Some tips on using function calls in the new @OpenAI release:

It's really a naming issue where a function call conflates structured json with tool use.

I'm going to got through some examples of just using function calls to extract json w/ @pydantic If you don't want to use json schema you can use python to define type safe schemas Image