Aparna Dhinakaran Profile picture
Oct 14 7 tweets 4 min read Read on X
We improved @cline, a popular open-source coding agent, by +15% accuracy on SWE-Bench — without retraining LLMs, changing tools, or modifying Cline's architecture.

We achieved this simply through optimizing its ruleset, in ./clinerules — a user defined section for developers to add custom instructions to the system prompt, just like .cursor/rules in Cursor or CLAUDE.md in Claude Code.

Using our algorithm, Prompt Learning, we automatically refined these rules across a feedback loop powered by GPT-5.

Here’s how we brought GPT-4.1’s performance on SWE-Bench Lite to near state-of-the-art levels — matching Claude Sonnet 4-5 — purely through ruleset optimization.

See our more detailed blog post 👉: arize.com/blog/optimizin…Image
What is Prompt Learning?

It’s an optimization algorithm for prompts.

Inspired by RL, it follows an action → evaluation → improvement loop — but instead of gradients, it uses Meta Prompting: feeding a prompt into an LLM and asking it to make it better.

We use LLM-generated feedback explaining why outputs were right or wrong, giving the optimizer richer signal to refine future prompts.

📈 Result: measurable gains in accuracy, zero retraining.

Use it in Arize AX or the Prompt Learning SDK.Image
How we optimized Cline:

Last time, we optimized Plan Mode - this time, we optimized over Act Mode - giving Cline full permissions to read, write, and edit code files, and testing its accuracy on SWE Bench Lite.

Our optimization loop 🔁:

1️⃣ Run Cline on SWE-Bench Lite (150 train, 150 test) and record its train/test accuracy.
2️⃣ Collect the patches it produces and verify correctness via unit tests.
3️⃣ Use GPT-5 to explain why each fix succeeded or failed on the training set.
4️⃣ Feed those training evals — along with Cline’s system prompt and current ruleset — into a Meta-Prompt LLM to generate an improved ruleset.
5️⃣ Update ./clinerules, re-run, and repeat.Image
Results:

Sonnet 4-5 saw a modest +6% training and +0.7% test gain — already near saturation — while GPT-4.1 improved 14–15% in both, reaching near-Sonnet performance (34% vs 36%) through ruleset optimization alone in just two loops!

These results highlight how prompt optimization alone can deliver system-level gains — no retraining, no new tools, no architecture changes. In just two optimization loops, Prompt Learning closed much of the gap between GPT-4.1 and Sonnet 4-5-level performance, proving how fast and data-efficient instruction-level optimization can be.Image
We used Phoenix to run LLM Evals on Cline’s code and track experiments across optimization runs. Image
Optimize Cline on SWE Bench using Prompt Learning and see improvement for yourself!

Code: github.com/Arize-ai/promp…
Use the Prompt Learning SDK: github.com/Arize-ai/promp…

Use Phoenix to run:
LLM Evals: arize.com/docs/phoenix/e…
Experiments: arize.com/docs/phoenix/d…
Tagging people who may be interested

@chengshuai_shi @ZhitingHu @HamelHusain @sh_reya @charlespacker @eugeneyan @swyx @dan_iter @sophiamyang @AndrewYNg @lateinteraction @cwolferesearch @tom_doerr @imjaredz @lennysan @shyamalanadkat @aakashg0 @apolloaievals @jerryjliu0 @joaomdmoura @jxnlco @DSPyOSS @abacaj @garrytan

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Aparna Dhinakaran

Aparna Dhinakaran Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @aparnadhinak

Jul 18
Reinforcement Learning in English – Prompt Learning Beyond just Optimization

@karpathy tweeted something this week that I think many of us have been feeling: the resurgence of RL is great, but it’s missing the big picture.

We believe that the industry chasing traditional RL is going the wrong direction. In chasing better policies and reward shaping, it’s easy to miss a simpler tool we already have: language.

Today, we’re releasing our first research on Prompt Learning — an approach that uses natural language feedback to guide and improve agents.

It’s not prompt tuning, chain-of-thought prompting, or DSPy Simba — though we love what the @DSPyOSS team is building.

Instead of adjusting weights, we use MetaPrompting — where English evals & critiques (rather than just scalar metrics like previously done by the industry) drive targeted prompt updates.

Tagging people who would find this interesting:
@chengshuai_shi @ZhitingHu @HamelHusain @sh_reya @charlespacker @eugeneyan @swyx @dan_iter @sophiamyang @AndrewYNg @lateinteraction @cwolferesearch @tom_doerr @imjaredz @lennysan @shyamalanadkat @aakashg0 @apolloaievals @jerryjliu0 @joaomdmoura @jxnlco @abacaj @garrytanImage
Prompts, like models, should improve with feedback — not stay static.

Here’s how prompt learning works:

1️⃣ The prompt is treated as an online object — something that evolves over time

2️⃣ A LLM (or human) provides an assessment and an English natural language critique, unlike most prompt optimization methods

3️⃣ That natural language feedback is used as an error signal, passed into a MetaPrompt

4️⃣ The MetaPrompt updates the original prompt — either by rewriting it or inserting targeted instructions into specific sections

English feedback becomes the learning signal.Image
English Evaluation -> Explanation & Critique

We’ve backed English-first evals since we shipped LLM as a Judge in @arizephoenix (2023)
​​
In this work, we complete the loop between evaluation and optimization—structured feedback becomes a mechanism not just for scoring, but for improving prompts.

Our system uses explanations, annotations, and rules as inputs to a MetaPrompt, refining prompts through natural language rather than weights.

Even a single round of English feedback can outperform scalar-only methods and beat baselines that already include the ruleset.

Text becomes the gradient. And the prompt learns.
Read 4 tweets
Jun 26, 2024
Does LLM as a Judge work for SQL-Gen?

Spoilers from our latest research - adding Schema table definitions to the Eval prompt might be the deciding factor in helping it “judge” better… but don’t add the entire schema!

We attempted to take an LLM as a Judge and use it on SQL generated queries comparing golden queries on golden datasets.

🧵below on how well does LLM as a Judge catch SQL issues and where does it go wrong:Image
(2/7) How do you even evaluate SQL generation?

The @defogdata team put out some research on this and has even released their own LLM for converting queries to LLM queries.

One approach to evaluating SQL queries (as done by defog) was to create a set of golden queries on a golden dataset, then comparing the resulting data.

✅A golden dataset question is used for AI SQL generation
✅The AI SQL result is to generate test result “A”
✅The golden dataset question has an associated golden query
✅The golden query is used on the database and returns results “B”
✅The results of the AI generated results “A” are compared with the golden results “B”Image
(3/7) What does exact data matching look like?

One approach @defogdata took to testing whether a SQL query is correct is by comparing exactly the returned data between two queries. The example here is a query on author citations. The results in this query contain differences in the number of authors returned and citations by those authors.

Mismatch and fail!

It's not quite so easy, should you fail on the 0 bins? 🤔Image
Read 7 tweets
Mar 29, 2024
(1/6) Can LLMs Do Time Series Analysis ⏲️? GPT-4 vs Claude 3 Opus 🥊

We have seen a lot of customers trying to apply LLMs to all kinds of data, but have not seen many Evals that show how well LLMs can analyze patterns in data that are not text related - especially timeseries🕰️

Ex: Teams are launching GPT stock pickers 💸without testing how well LLMs are at basic time series pattern analysis!

We set out to answer the following question: if we fed in a large set of time series data into the context window, how well can the LLM detect anomalies or movements in time series data 🤔?

AKA should you trust your money with a stock picking GPT-4 or Claude 3 agent? Cut to the chase - the answer is NO🚫!

We tested both GPT-4 and Claude to find the anomalous time series patterns, mixed in with the normal time series. The goal is to find the anomalous ones ONLY (aka the ones that had spikes).

Tagging relevant LLM Evals folks!
@rown @universeinanegg @ybisk @YejinChoinka
@allen_ai @haileysch__ @lintangsutawika @DanHendrycks @markchen90 @MillionInt @HenriquePonde @Shahules786 @karlcobbe @jerryjliu0 @mobav0 @lukaszkaiser @gdb @_akhaliq @JeffDean @demishassabis @jxnlco @OpenAI @AnthropicAI @GregKamradt @MiqJ @ArizePhoenix @arizeai

🧵 below shows results:Image
(2/6) Test Results

We tested a sweep of context windows of different lengths. In each test, we stuffed 100’s of time series 📈where each time series represented a metric graphed over time (in JSON) for a world city 🌐.

The LLMs are asked to detect movements or increases over a specific % of size and name the time series “city” 🌐 & date 📅 where the anomaly was detected.

This is a pretty hard test - an LLM is required to look at patterns throughout its context window, and detect anomalies across a large set of time series at the same time. It is also required to synthesize the results, name the time series and date of the movement date, and group by date. It’s also required to not produce false positives by talking about the other cities in the data.

The % movement calculation requires the LLM to do “math” 🔢 over the time series, which they are generally not very good at doing. The test below shows the stark difference between Claude 3 Opus, Claude 3 Sonnet, and GPT-4 Turbo 🥊Image
(3/6) Setting Up a Time Series Test! 📈

We created a test suite that iterates through different context window sizes, generating a set of time series slots.

✅Created JSON formatted time series data with random noise
✅The noise can be 20-30% of the range
✅Tested % increases of single days of data that are above the noise level
✅Tested both extending the days of the anomaly and extending the % of increase
✅Tested a case where the math required is easier - we pre-calculate standard deviation
✅Tested a small number of anomaly events and a larger number of anomaly events

The image shown here is a view of what's in the context window using @ArizePhoenix!Image
Read 6 tweets
Mar 8, 2024
(1/5) 🚀💥Wow! With @AnthropicAI Opus Claude-3, A GPT-4 Competitor Has Arrived!!

🗓️March 14th 2023 was the launch of GPT-4! Nearly a year later, a true rival emerges.

🔍We tested Opus Claude-3 using Evals built on @ArizePhoenix . The @AnthropicAI team was not kidding, it’s really good 🔥 These are most extensive Model Evals of Claude 3 in the ecosystem - they're 3rd party, handcrafted evals that likely have not been overfit, yet, by the model providers.

📊 First, the Model Eval test: Retrieval with Generation. This is not a normal haystack retrieval test. This is an aggressive variation of haystack that requires:
✅Retrieving 2 random numbers placed in the same section of the corpus
✅ Rounding those numbers to 2 decimal places
✅Formatting as $
✅Calculating a percentage growth

This is exactly what you might be doing if you are, say, analyzing SEC filings. It is a real example that matches actual use cases.

📈Now the results! Claude 3 is a huge upgrade from Claude 2.1 and looks to be competitive with GPT-4. The mistakes in the model results were all very small rounding errors of typically around 1% in the percentage calculation. It's not a huge mistake relative to what we’ve seen for auto-regressive models.

@GregKamradt is working on a repeatable version of these Haystack tests that can be run more periodically. Excited to see this come to fruition.

Tagging relevant LLM Evals folks! @rown @universeinanegg @ybisk @YejinChoinka @allen_ai
@haileysch__ @lintangsutawika @DanHendrycks @markchen90 @MillionInt @HenriquePonde @Shahules786 @karlcobbe @mobav0 @lukaszkaiser @gdb @_akhaliq @JeffDean @demishassabis @jxnlco @OpenAIImage
(2/5) Test #2 : Retrieval with Generation Date Mapping

⭐We ran another retrieval test 🧪with generation called date mapping. This Model Eval, tests the ability of a model to retrieve data from the context window 🔎, manipulate with a small addition ➕and finally do an internal mapping to a string format as a month .

The results were far better than @AnthropicAI Claude 2.1 - we have a new formidable AI in Claude 3 🧠

What this test does is:
✅Retrieve 2 random numbers from the same area in the context window
✅Grab the last digit of one of those numbers that is 7 digits
✅Add 1 to it so the number is in the range from 1 to 11
✅Map that number to the month of the year
✅Concatenate it with a string representation of the other number which is a day
✅Show the final month:day together

The mistakes both in this test and the previous are fairly small mistakes and the models really should be viewed as comparable in this test area.Image
(3/5) How does Opus Claude 3 stack up as LLM as a Judge?

We have an additional suite of tests 🧪 that we’ve handcrafted to understand how well a model can be used for Evals, the LLM as a judge ⚖️use case. These model Evals spans a number of LLM task evals - span retrieval, hallucinations, Q&A, code functionality, and reference text analysis .

⭐️Where the @AnthropicAI Claude 2.1 model had fallen short of our expectations, Claude 3 shines.

At this point when you dig into the individual examples where one model is beating another relative to @OpenAI GPT-4 and Claude 3, you will find the mistakes are really in the gray area.

Where we are landing: for now, on these tests, the two models are fairly comparable for LLM as a Judge.Image
Read 5 tweets
Mar 5, 2024
(1/4) Gemini has been trending a lot on twitter 🔥 We wanted to bring the conversation back to actual LLM evals results. Through a lot of testing, we have found Gemini to be a very solid model.

We recently made 2️⃣ updates to the Gemini Needle in a Haystack test 🪡 based on some notes from the Google team. The final results show a perfect haystack result similar to @JeffDean results 💯

✅ Tokenizer: The tokenizer used was incorrect and threw off the results from the first test. Fixing this did not fix all the results, but it did improve results. This is our miss.
✅ Prompting: Matching the prompt to @AnthropicAI , gave Gemini the best results yet, a perfect execution by Gemini by simply using the Anthropic prompt addition.

All evals run using @ArizePhoenix. Tagging relevant Evals folks! @rown @universeinanegg @ybisk @YejinChoinka @allen_ai @haileysch__ @lintangsutawika @DanHendrycks @markchen90 @MillionInt @HenriquePonde @Shahules786 @karlcobbe @mobav0 @lukaszkaiser @gdbImage
(2/4) Prompts Matter!!! ✨

There are 2⃣ prompts used here on the haystack test that gave fairly different results.

Our first prompt has not changed a lot since the first release, but I want to note we did not iterate on it to specifically improve retrieval.

We tested the @AnthropicAI addition we made to the prompt on @Google Gemini and it drastically improved the results 🔥

Here is the modified prompt:

"""

{context}


{question} Don't give information outside the document or repeat your findings.
Here is the magic number from the context:
"""

For more details on the prompt, check out the Github repo: github.com/Arize-ai/LLMTe…
(3/4) The tokenizer used in the Haystack tests was not the original T5 tokenizer.

This can create big problems in mapping between insertion phrases and actual content. This was one reason for the very very poor showing in the initial tests.

Great thread unfolding by @karpathy on Tokenizers here:

All things said, @GoogleAI Gemini is a solid model 🦾and I think people are missing some of this on twitter. We will continue to test Gemini against Evals and show off results whether good or bad.

@Google, please give us access to Ultra!
Read 4 tweets
Jan 31, 2024
(1/8) LLM Model Evals 💪vs LLM Task Evals 🥊

Evals are all the rage 🔥, but they mean different things to different people.

The biggest confusion is that there are actually 2 different categories of evals.
1⃣Model evals (ex: HellaSwag, MMLU, TruthfulQA etc)
2⃣Task evals (ex: Q&A from Phoenix Evals: )

Model Evals vs Task Evals is the difference between measuring "generalized fitness" 💪 and "specialized fitness" 🥊

Most of us would like to have generalized fitness because it allows us to do a variety of everyday activities well. But if sumo wrestling was your dream, you would obviously prefer to have a much larger body mass.

The problem is, most practitioners today are focusing on generalized fitness and getting crushed in the ring ☠️

🧵 on the differences

Tagging folks working on the LLM Model or Task Eval space!

@rown @universeinanegg @ybisk @YejinChoinka @allen_ai @haileysch__ @lintangsutawika @DanHendrycks @markchen90 @MillionInt @HenriquePonde @Shahules786 @karlcobbe @mobav0 @lukaszkaisergithub.com/Arize-ai/phoen…Image
(2/8) What's a Model Eval? 🤔

If you are an LLM application builder (the vast majority of us), you are wasting your time looking at model evals ❌ Model Evals are really for people who are building or fine-tuning an LLM.

The only reason you as an LLM Application builder would look at it is to choose the best model to use in your system⚖️

Ok, so what is it?? A Model Eval is a set of questions you ask a model with a set of ground truth answers that you use to “grade” the response.

The @huggingface Open LLM leaderboard lists many common Model Evals results that include HellaSwag, MMLU, TruthfulQA, GSM8K etc… :


If you want to run a Model Eval harness yourself, you can check out libraries such as @AiEleuther or @OpenAI.

huggingface.co/spaces/Hugging…
github.com/EleutherAI/lm-…
github.com/openai/evalsImage
(3/8) Examples of Model Evals:

In a Model Eval test set, every question is different, but many test sets have a theme or skill they are probing.

💎HellaSwag is a set of sentence completions that require a model to infer what might happen next in a specific scenario.
Example from Hella Swag
A tray of potatoes is loaded into the oven and removed. A large tray of cake is flipped over and placed on counter. a large tray of meat

A. is placed onto a baked potato
B. ls, and pickles are placed in the oven
C. is prepared then it is removed from the oven by a helper when done.

🔡MMLU is a broad set of questions from 57 different subjects. This one is from the College Physics section.

Example from MMLU
For which of the following thermodynamic processes is the increase in the internal energy of an ideal gas equal to the heat added to the gas?

A. Constant Temperature
B. Constant Volume
C. Constant Pressure
D. Adiabatic

These questions look tough ... BUT the dirty secret among the community is that there is likely a lot of data leakage and gaming of these public Model Evals 🤫Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(