We improved @cline, a popular open-source coding agent, by +15% accuracy on SWE-Bench — without retraining LLMs, changing tools, or modifying Cline's architecture.
We achieved this simply through optimizing its ruleset, in ./clinerules — a user defined section for developers to add custom instructions to the system prompt, just like .cursor/rules in Cursor or CLAUDE.md in Claude Code.
Using our algorithm, Prompt Learning, we automatically refined these rules across a feedback loop powered by GPT-5.
Here’s how we brought GPT-4.1’s performance on SWE-Bench Lite to near state-of-the-art levels — matching Claude Sonnet 4-5 — purely through ruleset optimization.
Inspired by RL, it follows an action → evaluation → improvement loop — but instead of gradients, it uses Meta Prompting: feeding a prompt into an LLM and asking it to make it better.
We use LLM-generated feedback explaining why outputs were right or wrong, giving the optimizer richer signal to refine future prompts.
📈 Result: measurable gains in accuracy, zero retraining.
Use it in Arize AX or the Prompt Learning SDK.
How we optimized Cline:
Last time, we optimized Plan Mode - this time, we optimized over Act Mode - giving Cline full permissions to read, write, and edit code files, and testing its accuracy on SWE Bench Lite.
Our optimization loop 🔁:
1️⃣ Run Cline on SWE-Bench Lite (150 train, 150 test) and record its train/test accuracy.
2️⃣ Collect the patches it produces and verify correctness via unit tests.
3️⃣ Use GPT-5 to explain why each fix succeeded or failed on the training set.
4️⃣ Feed those training evals — along with Cline’s system prompt and current ruleset — into a Meta-Prompt LLM to generate an improved ruleset.
5️⃣ Update ./clinerules, re-run, and repeat.
Results:
Sonnet 4-5 saw a modest +6% training and +0.7% test gain — already near saturation — while GPT-4.1 improved 14–15% in both, reaching near-Sonnet performance (34% vs 36%) through ruleset optimization alone in just two loops!
These results highlight how prompt optimization alone can deliver system-level gains — no retraining, no new tools, no architecture changes. In just two optimization loops, Prompt Learning closed much of the gap between GPT-4.1 and Sonnet 4-5-level performance, proving how fast and data-efficient instruction-level optimization can be.
We used Phoenix to run LLM Evals on Cline’s code and track experiments across optimization runs.
Optimize Cline on SWE Bench using Prompt Learning and see improvement for yourself!
Reinforcement Learning in English – Prompt Learning Beyond just Optimization
@karpathy tweeted something this week that I think many of us have been feeling: the resurgence of RL is great, but it’s missing the big picture.
We believe that the industry chasing traditional RL is going the wrong direction. In chasing better policies and reward shaping, it’s easy to miss a simpler tool we already have: language.
Today, we’re releasing our first research on Prompt Learning — an approach that uses natural language feedback to guide and improve agents.
It’s not prompt tuning, chain-of-thought prompting, or DSPy Simba — though we love what the @DSPyOSS team is building.
Instead of adjusting weights, we use MetaPrompting — where English evals & critiques (rather than just scalar metrics like previously done by the industry) drive targeted prompt updates.
Tagging people who would find this interesting:
@chengshuai_shi @ZhitingHu @HamelHusain @sh_reya @charlespacker @eugeneyan @swyx @dan_iter @sophiamyang @AndrewYNg @lateinteraction @cwolferesearch @tom_doerr @imjaredz @lennysan @shyamalanadkat @aakashg0 @apolloaievals @jerryjliu0 @joaomdmoura @jxnlco @abacaj @garrytan
Prompts, like models, should improve with feedback — not stay static.
Here’s how prompt learning works:
1️⃣ The prompt is treated as an online object — something that evolves over time
2️⃣ A LLM (or human) provides an assessment and an English natural language critique, unlike most prompt optimization methods
3️⃣ That natural language feedback is used as an error signal, passed into a MetaPrompt
4️⃣ The MetaPrompt updates the original prompt — either by rewriting it or inserting targeted instructions into specific sections
English feedback becomes the learning signal.
English Evaluation -> Explanation & Critique
We’ve backed English-first evals since we shipped LLM as a Judge in @arizephoenix (2023)
In this work, we complete the loop between evaluation and optimization—structured feedback becomes a mechanism not just for scoring, but for improving prompts.
Our system uses explanations, annotations, and rules as inputs to a MetaPrompt, refining prompts through natural language rather than weights.
Even a single round of English feedback can outperform scalar-only methods and beat baselines that already include the ruleset.
Spoilers from our latest research - adding Schema table definitions to the Eval prompt might be the deciding factor in helping it “judge” better… but don’t add the entire schema!
We attempted to take an LLM as a Judge and use it on SQL generated queries comparing golden queries on golden datasets.
🧵below on how well does LLM as a Judge catch SQL issues and where does it go wrong:
(2/7) How do you even evaluate SQL generation?
The @defogdata team put out some research on this and has even released their own LLM for converting queries to LLM queries.
One approach to evaluating SQL queries (as done by defog) was to create a set of golden queries on a golden dataset, then comparing the resulting data.
✅A golden dataset question is used for AI SQL generation
✅The AI SQL result is to generate test result “A”
✅The golden dataset question has an associated golden query
✅The golden query is used on the database and returns results “B”
✅The results of the AI generated results “A” are compared with the golden results “B”
(3/7) What does exact data matching look like?
One approach @defogdata took to testing whether a SQL query is correct is by comparing exactly the returned data between two queries. The example here is a query on author citations. The results in this query contain differences in the number of authors returned and citations by those authors.
Mismatch and fail!
It's not quite so easy, should you fail on the 0 bins? 🤔
(1/6) Can LLMs Do Time Series Analysis ⏲️? GPT-4 vs Claude 3 Opus 🥊
We have seen a lot of customers trying to apply LLMs to all kinds of data, but have not seen many Evals that show how well LLMs can analyze patterns in data that are not text related - especially timeseries🕰️
Ex: Teams are launching GPT stock pickers 💸without testing how well LLMs are at basic time series pattern analysis!
We set out to answer the following question: if we fed in a large set of time series data into the context window, how well can the LLM detect anomalies or movements in time series data 🤔?
AKA should you trust your money with a stock picking GPT-4 or Claude 3 agent? Cut to the chase - the answer is NO🚫!
We tested both GPT-4 and Claude to find the anomalous time series patterns, mixed in with the normal time series. The goal is to find the anomalous ones ONLY (aka the ones that had spikes).
We tested a sweep of context windows of different lengths. In each test, we stuffed 100’s of time series 📈where each time series represented a metric graphed over time (in JSON) for a world city 🌐.
The LLMs are asked to detect movements or increases over a specific % of size and name the time series “city” 🌐 & date 📅 where the anomaly was detected.
This is a pretty hard test - an LLM is required to look at patterns throughout its context window, and detect anomalies across a large set of time series at the same time. It is also required to synthesize the results, name the time series and date of the movement date, and group by date. It’s also required to not produce false positives by talking about the other cities in the data.
The % movement calculation requires the LLM to do “math” 🔢 over the time series, which they are generally not very good at doing. The test below shows the stark difference between Claude 3 Opus, Claude 3 Sonnet, and GPT-4 Turbo 🥊
(3/6) Setting Up a Time Series Test! 📈
We created a test suite that iterates through different context window sizes, generating a set of time series slots.
✅Created JSON formatted time series data with random noise
✅The noise can be 20-30% of the range
✅Tested % increases of single days of data that are above the noise level
✅Tested both extending the days of the anomaly and extending the % of increase
✅Tested a case where the math required is easier - we pre-calculate standard deviation
✅Tested a small number of anomaly events and a larger number of anomaly events
The image shown here is a view of what's in the context window using @ArizePhoenix!
(1/5) 🚀💥Wow! With @AnthropicAI Opus Claude-3, A GPT-4 Competitor Has Arrived!!
🗓️March 14th 2023 was the launch of GPT-4! Nearly a year later, a true rival emerges.
🔍We tested Opus Claude-3 using Evals built on @ArizePhoenix . The @AnthropicAI team was not kidding, it’s really good 🔥 These are most extensive Model Evals of Claude 3 in the ecosystem - they're 3rd party, handcrafted evals that likely have not been overfit, yet, by the model providers.
📊 First, the Model Eval test: Retrieval with Generation. This is not a normal haystack retrieval test. This is an aggressive variation of haystack that requires:
✅Retrieving 2 random numbers placed in the same section of the corpus
✅ Rounding those numbers to 2 decimal places
✅Formatting as $
✅Calculating a percentage growth
This is exactly what you might be doing if you are, say, analyzing SEC filings. It is a real example that matches actual use cases.
📈Now the results! Claude 3 is a huge upgrade from Claude 2.1 and looks to be competitive with GPT-4. The mistakes in the model results were all very small rounding errors of typically around 1% in the percentage calculation. It's not a huge mistake relative to what we’ve seen for auto-regressive models.
@GregKamradt is working on a repeatable version of these Haystack tests that can be run more periodically. Excited to see this come to fruition.
(2/5) Test #2 : Retrieval with Generation Date Mapping
⭐We ran another retrieval test 🧪with generation called date mapping. This Model Eval, tests the ability of a model to retrieve data from the context window 🔎, manipulate with a small addition ➕and finally do an internal mapping to a string format as a month .
The results were far better than @AnthropicAI Claude 2.1 - we have a new formidable AI in Claude 3 🧠
What this test does is:
✅Retrieve 2 random numbers from the same area in the context window
✅Grab the last digit of one of those numbers that is 7 digits
✅Add 1 to it so the number is in the range from 1 to 11
✅Map that number to the month of the year
✅Concatenate it with a string representation of the other number which is a day
✅Show the final month:day together
The mistakes both in this test and the previous are fairly small mistakes and the models really should be viewed as comparable in this test area.
(3/5) How does Opus Claude 3 stack up as LLM as a Judge?
We have an additional suite of tests 🧪 that we’ve handcrafted to understand how well a model can be used for Evals, the LLM as a judge ⚖️use case. These model Evals spans a number of LLM task evals - span retrieval, hallucinations, Q&A, code functionality, and reference text analysis .
⭐️Where the @AnthropicAI Claude 2.1 model had fallen short of our expectations, Claude 3 shines.
At this point when you dig into the individual examples where one model is beating another relative to @OpenAI GPT-4 and Claude 3, you will find the mistakes are really in the gray area.
Where we are landing: for now, on these tests, the two models are fairly comparable for LLM as a Judge.
(1/4) Gemini has been trending a lot on twitter 🔥 We wanted to bring the conversation back to actual LLM evals results. Through a lot of testing, we have found Gemini to be a very solid model.
We recently made 2️⃣ updates to the Gemini Needle in a Haystack test 🪡 based on some notes from the Google team. The final results show a perfect haystack result similar to @JeffDean results 💯
✅ Tokenizer: The tokenizer used was incorrect and threw off the results from the first test. Fixing this did not fix all the results, but it did improve results. This is our miss.
✅ Prompting: Matching the prompt to @AnthropicAI , gave Gemini the best results yet, a perfect execution by Gemini by simply using the Anthropic prompt addition.
All evals run using @ArizePhoenix. Tagging relevant Evals folks! @rown @universeinanegg @ybisk @YejinChoinka @allen_ai @haileysch__ @lintangsutawika @DanHendrycks @markchen90 @MillionInt @HenriquePonde @Shahules786 @karlcobbe @mobav0 @lukaszkaiser @gdb
(2/4) Prompts Matter!!! ✨
There are 2⃣ prompts used here on the haystack test that gave fairly different results.
Our first prompt has not changed a lot since the first release, but I want to note we did not iterate on it to specifically improve retrieval.
We tested the @AnthropicAI addition we made to the prompt on @Google Gemini and it drastically improved the results 🔥
Here is the modified prompt:
"""
{context}
{question} Don't give information outside the document or repeat your findings.
Here is the magic number from the context:
"""
(3/4) The tokenizer used in the Haystack tests was not the original T5 tokenizer.
This can create big problems in mapping between insertion phrases and actual content. This was one reason for the very very poor showing in the initial tests.
Great thread unfolding by @karpathy on Tokenizers here:
All things said, @GoogleAI Gemini is a solid model 🦾and I think people are missing some of this on twitter. We will continue to test Gemini against Evals and show off results whether good or bad.
Evals are all the rage 🔥, but they mean different things to different people.
The biggest confusion is that there are actually 2 different categories of evals.
1⃣Model evals (ex: HellaSwag, MMLU, TruthfulQA etc)
2⃣Task evals (ex: Q&A from Phoenix Evals: )
Model Evals vs Task Evals is the difference between measuring "generalized fitness" 💪 and "specialized fitness" 🥊
Most of us would like to have generalized fitness because it allows us to do a variety of everyday activities well. But if sumo wrestling was your dream, you would obviously prefer to have a much larger body mass.
The problem is, most practitioners today are focusing on generalized fitness and getting crushed in the ring ☠️
🧵 on the differences
Tagging folks working on the LLM Model or Task Eval space!
If you are an LLM application builder (the vast majority of us), you are wasting your time looking at model evals ❌ Model Evals are really for people who are building or fine-tuning an LLM.
The only reason you as an LLM Application builder would look at it is to choose the best model to use in your system⚖️
Ok, so what is it?? A Model Eval is a set of questions you ask a model with a set of ground truth answers that you use to “grade” the response.
The @huggingface Open LLM leaderboard lists many common Model Evals results that include HellaSwag, MMLU, TruthfulQA, GSM8K etc… :
If you want to run a Model Eval harness yourself, you can check out libraries such as @AiEleuther or @OpenAI.
In a Model Eval test set, every question is different, but many test sets have a theme or skill they are probing.
💎HellaSwag is a set of sentence completions that require a model to infer what might happen next in a specific scenario.
Example from Hella Swag
A tray of potatoes is loaded into the oven and removed. A large tray of cake is flipped over and placed on counter. a large tray of meat
A. is placed onto a baked potato
B. ls, and pickles are placed in the oven
C. is prepared then it is removed from the oven by a helper when done.
🔡MMLU is a broad set of questions from 57 different subjects. This one is from the College Physics section.
Example from MMLU
For which of the following thermodynamic processes is the increase in the internal energy of an ideal gas equal to the heat added to the gas?
A. Constant Temperature
B. Constant Volume
C. Constant Pressure
D. Adiabatic
These questions look tough ... BUT the dirty secret among the community is that there is likely a lot of data leakage and gaming of these public Model Evals 🤫