AI Founder: building @arizeai & @arizephoenix 💙
I post about LLMs, LLMOps, Generative AI, ML and occasionally Amazing Race
2 subscribers
Jun 26, 2024 • 7 tweets • 4 min read
Does LLM as a Judge work for SQL-Gen?
Spoilers from our latest research - adding Schema table definitions to the Eval prompt might be the deciding factor in helping it “judge” better… but don’t add the entire schema!
We attempted to take an LLM as a Judge and use it on SQL generated queries comparing golden queries on golden datasets.
🧵below on how well does LLM as a Judge catch SQL issues and where does it go wrong:
(2/7) How do you even evaluate SQL generation?
The @defogdata team put out some research on this and has even released their own LLM for converting queries to LLM queries.
One approach to evaluating SQL queries (as done by defog) was to create a set of golden queries on a golden dataset, then comparing the resulting data.
✅A golden dataset question is used for AI SQL generation
✅The AI SQL result is to generate test result “A”
✅The golden dataset question has an associated golden query
✅The golden query is used on the database and returns results “B”
✅The results of the AI generated results “A” are compared with the golden results “B”
Mar 29, 2024 • 6 tweets • 5 min read
(1/6) Can LLMs Do Time Series Analysis ⏲️? GPT-4 vs Claude 3 Opus 🥊
We have seen a lot of customers trying to apply LLMs to all kinds of data, but have not seen many Evals that show how well LLMs can analyze patterns in data that are not text related - especially timeseries🕰️
Ex: Teams are launching GPT stock pickers 💸without testing how well LLMs are at basic time series pattern analysis!
We set out to answer the following question: if we fed in a large set of time series data into the context window, how well can the LLM detect anomalies or movements in time series data 🤔?
AKA should you trust your money with a stock picking GPT-4 or Claude 3 agent? Cut to the chase - the answer is NO🚫!
We tested both GPT-4 and Claude to find the anomalous time series patterns, mixed in with the normal time series. The goal is to find the anomalous ones ONLY (aka the ones that had spikes).
We tested a sweep of context windows of different lengths. In each test, we stuffed 100’s of time series 📈where each time series represented a metric graphed over time (in JSON) for a world city 🌐.
The LLMs are asked to detect movements or increases over a specific % of size and name the time series “city” 🌐 & date 📅 where the anomaly was detected.
This is a pretty hard test - an LLM is required to look at patterns throughout its context window, and detect anomalies across a large set of time series at the same time. It is also required to synthesize the results, name the time series and date of the movement date, and group by date. It’s also required to not produce false positives by talking about the other cities in the data.
The % movement calculation requires the LLM to do “math” 🔢 over the time series, which they are generally not very good at doing. The test below shows the stark difference between Claude 3 Opus, Claude 3 Sonnet, and GPT-4 Turbo 🥊
Mar 8, 2024 • 5 tweets • 4 min read
(1/5) 🚀💥Wow! With @AnthropicAI Opus Claude-3, A GPT-4 Competitor Has Arrived!!
🗓️March 14th 2023 was the launch of GPT-4! Nearly a year later, a true rival emerges.
🔍We tested Opus Claude-3 using Evals built on @ArizePhoenix . The @AnthropicAI team was not kidding, it’s really good 🔥 These are most extensive Model Evals of Claude 3 in the ecosystem - they're 3rd party, handcrafted evals that likely have not been overfit, yet, by the model providers.
📊 First, the Model Eval test: Retrieval with Generation. This is not a normal haystack retrieval test. This is an aggressive variation of haystack that requires:
✅Retrieving 2 random numbers placed in the same section of the corpus
✅ Rounding those numbers to 2 decimal places
✅Formatting as $
✅Calculating a percentage growth
This is exactly what you might be doing if you are, say, analyzing SEC filings. It is a real example that matches actual use cases.
📈Now the results! Claude 3 is a huge upgrade from Claude 2.1 and looks to be competitive with GPT-4. The mistakes in the model results were all very small rounding errors of typically around 1% in the percentage calculation. It's not a huge mistake relative to what we’ve seen for auto-regressive models.
@GregKamradt is working on a repeatable version of these Haystack tests that can be run more periodically. Excited to see this come to fruition.
⭐We ran another retrieval test 🧪with generation called date mapping. This Model Eval, tests the ability of a model to retrieve data from the context window 🔎, manipulate with a small addition ➕and finally do an internal mapping to a string format as a month .
The results were far better than @AnthropicAI Claude 2.1 - we have a new formidable AI in Claude 3 🧠
What this test does is:
✅Retrieve 2 random numbers from the same area in the context window
✅Grab the last digit of one of those numbers that is 7 digits
✅Add 1 to it so the number is in the range from 1 to 11
✅Map that number to the month of the year
✅Concatenate it with a string representation of the other number which is a day
✅Show the final month:day together
The mistakes both in this test and the previous are fairly small mistakes and the models really should be viewed as comparable in this test area.
Mar 5, 2024 • 4 tweets • 3 min read
(1/4) Gemini has been trending a lot on twitter 🔥 We wanted to bring the conversation back to actual LLM evals results. Through a lot of testing, we have found Gemini to be a very solid model.
We recently made 2️⃣ updates to the Gemini Needle in a Haystack test 🪡 based on some notes from the Google team. The final results show a perfect haystack result similar to @JeffDean results 💯
✅ Tokenizer: The tokenizer used was incorrect and threw off the results from the first test. Fixing this did not fix all the results, but it did improve results. This is our miss.
✅ Prompting: Matching the prompt to @AnthropicAI , gave Gemini the best results yet, a perfect execution by Gemini by simply using the Anthropic prompt addition.
All evals run using @ArizePhoenix. Tagging relevant Evals folks! @rown @universeinanegg @ybisk @YejinChoinka @allen_ai @haileysch__ @lintangsutawika @DanHendrycks @markchen90 @MillionInt @HenriquePonde @Shahules786 @karlcobbe @mobav0 @lukaszkaiser @gdb
(2/4) Prompts Matter!!! ✨
There are 2⃣ prompts used here on the haystack test that gave fairly different results.
Our first prompt has not changed a lot since the first release, but I want to note we did not iterate on it to specifically improve retrieval.
We tested the @AnthropicAI addition we made to the prompt on @Google Gemini and it drastically improved the results 🔥
Here is the modified prompt:
"""
{context}
{question} Don't give information outside the document or repeat your findings.
Here is the magic number from the context:
"""
Evals are all the rage 🔥, but they mean different things to different people.
The biggest confusion is that there are actually 2 different categories of evals.
1⃣Model evals (ex: HellaSwag, MMLU, TruthfulQA etc)
2⃣Task evals (ex: Q&A from Phoenix Evals: )
Model Evals vs Task Evals is the difference between measuring "generalized fitness" 💪 and "specialized fitness" 🥊
Most of us would like to have generalized fitness because it allows us to do a variety of everyday activities well. But if sumo wrestling was your dream, you would obviously prefer to have a much larger body mass.
The problem is, most practitioners today are focusing on generalized fitness and getting crushed in the ring ☠️
🧵 on the differences
Tagging folks working on the LLM Model or Task Eval space!
If you are an LLM application builder (the vast majority of us), you are wasting your time looking at model evals ❌ Model Evals are really for people who are building or fine-tuning an LLM.
The only reason you as an LLM Application builder would look at it is to choose the best model to use in your system⚖️
Ok, so what is it?? A Model Eval is a set of questions you ask a model with a set of ground truth answers that you use to “grade” the response.
The @huggingface Open LLM leaderboard lists many common Model Evals results that include HellaSwag, MMLU, TruthfulQA, GSM8K etc… :
If you want to run a Model Eval harness yourself, you can check out libraries such as @AiEleuther or @OpenAI.
(1/9) LLM as a Judge: Numeric Score Evals are Broken!!!
LLM Evals are valuable analysis tools. But should you use numeric scores or classes as outputs? 🤔
TLDR: LLM’s suck at continuous ranges ☠️ - use LLM classification evals instead! 🔤
An LLM Score Eval uses an LLM to judge the quality of an LLM output (such as summarization) and outputs a numeric score. Examples of this include OpenAI cookbooks.
In the example here, we ran a spelling eval to evaluate how many words have a spelling error in a document and give a score between 1-10.
✅If every word had a spelling error, we’d expect a score of 10.
✅If no words had a spelling error, we’d expect a score of 0.
✅Everything in the middle would fall into the range with higher percentage of spelling errors landing closer to 10, and lower landing closer to 0.
✅Our expectation: the continuous score value should have some clear connection to the quantity of spelling errors in the paragraph.
‼️Our results however did not show that the score value had a consistent meaning. ‼️
In the example below, we have a paragraph where 80% of words have spelling errors, but has a score of a 10. We also have a paragraph with 10% of words having a spelling error with the same score of 10!
🧵 below is a rigorous study in how well LLMs handle continuous numeric ranges for LLM Evals.cookbook.openai.com/examples/evalu…
(2/9) What we did:
We inserted a quantifiable level of errors into a document. We asked an LLM to generate a score based on the % of words that had spelling errors.
The results are not great, almost 80% of the quantitative range is scored as the same number. Each point is a median of multiple random experiments with the same % corruption level.