Aparna Dhinakaran Profile picture
Mar 8 5 tweets 4 min read Read on X
(1/5) 🚀💥Wow! With @AnthropicAI Opus Claude-3, A GPT-4 Competitor Has Arrived!!

🗓️March 14th 2023 was the launch of GPT-4! Nearly a year later, a true rival emerges.

🔍We tested Opus Claude-3 using Evals built on @ArizePhoenix . The @AnthropicAI team was not kidding, it’s really good 🔥 These are most extensive Model Evals of Claude 3 in the ecosystem - they're 3rd party, handcrafted evals that likely have not been overfit, yet, by the model providers.

📊 First, the Model Eval test: Retrieval with Generation. This is not a normal haystack retrieval test. This is an aggressive variation of haystack that requires:
✅Retrieving 2 random numbers placed in the same section of the corpus
✅ Rounding those numbers to 2 decimal places
✅Formatting as $
✅Calculating a percentage growth

This is exactly what you might be doing if you are, say, analyzing SEC filings. It is a real example that matches actual use cases.

📈Now the results! Claude 3 is a huge upgrade from Claude 2.1 and looks to be competitive with GPT-4. The mistakes in the model results were all very small rounding errors of typically around 1% in the percentage calculation. It's not a huge mistake relative to what we’ve seen for auto-regressive models.

@GregKamradt is working on a repeatable version of these Haystack tests that can be run more periodically. Excited to see this come to fruition.

Tagging relevant LLM Evals folks! @rown @universeinanegg @ybisk @YejinChoinka @allen_ai
@haileysch__ @lintangsutawika @DanHendrycks @markchen90 @MillionInt @HenriquePonde @Shahules786 @karlcobbe @mobav0 @lukaszkaiser @gdb @_akhaliq @JeffDean @demishassabis @jxnlco @OpenAIImage
(2/5) Test #2 : Retrieval with Generation Date Mapping

⭐We ran another retrieval test 🧪with generation called date mapping. This Model Eval, tests the ability of a model to retrieve data from the context window 🔎, manipulate with a small addition ➕and finally do an internal mapping to a string format as a month .

The results were far better than @AnthropicAI Claude 2.1 - we have a new formidable AI in Claude 3 🧠

What this test does is:
✅Retrieve 2 random numbers from the same area in the context window
✅Grab the last digit of one of those numbers that is 7 digits
✅Add 1 to it so the number is in the range from 1 to 11
✅Map that number to the month of the year
✅Concatenate it with a string representation of the other number which is a day
✅Show the final month:day together

The mistakes both in this test and the previous are fairly small mistakes and the models really should be viewed as comparable in this test area.Image
(3/5) How does Opus Claude 3 stack up as LLM as a Judge?

We have an additional suite of tests 🧪 that we’ve handcrafted to understand how well a model can be used for Evals, the LLM as a judge ⚖️use case. These model Evals spans a number of LLM task evals - span retrieval, hallucinations, Q&A, code functionality, and reference text analysis .

⭐️Where the @AnthropicAI Claude 2.1 model had fallen short of our expectations, Claude 3 shines.

At this point when you dig into the individual examples where one model is beating another relative to @OpenAI GPT-4 and Claude 3, you will find the mistakes are really in the gray area.

Where we are landing: for now, on these tests, the two models are fairly comparable for LLM as a Judge.Image
(4/5) Observation: Claude 3 did do worse on the Original Haystack Test...

🤔One strange thing we noticed is that on the more simple original haystack retrieval Claude 3 did worse than Claude 2. This seemed strange to us, given the retrieval with generation tests are actually harder tests that require a retrieval and some cognitive “work”.

The Original Haystack test does the following:
✅Retrieve random number from corpus
✅Check random number matches

The results are strange enough to us that we want to hold out that we might have made some mistake. We’ve dug in and will continue to try to quadruple confirm the results are correct.Image
(5/5) Open Areas of Research (Instruction Following)

One area that is hard to capture in these tests is how good a model is at instruction following and how much work you have to do to get a model to do something. The not so secret approach to get models to follow instructions is by using repetition in the prompt. The repetition seems to probably help by doubling up emphasis in latent attention space, on the thing you want the model to do by increasing important embeddings by averaging more of the same together.

I’m surprised “repetition of instruction” isn’t as well known a technique as say COT (chain of thought), which we did use in here as well to help with arithmetic. In our day to day prompting with customers, we use repetition a lot.

In these original Haystack tests, we used suggested prompting approaches by the model providers that do include repetition. Our original simple prompt did not use it, GPT-4 still out performs on this one, not requiring quite as much work to get the same results.

All that said, this area of analysis on “ease of instruction following” could probably be its own thread of research.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Aparna Dhinakaran

Aparna Dhinakaran Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @aparnadhinak

Jan 19
(1/9) LLM as a Judge: Numeric Score Evals are Broken!!!

LLM Evals are valuable analysis tools. But should you use numeric scores or classes as outputs? 🤔

TLDR: LLM’s suck at continuous ranges ☠️ - use LLM classification evals instead! 🔤

An LLM Score Eval uses an LLM to judge the quality of an LLM output (such as summarization) and outputs a numeric score. Examples of this include OpenAI cookbooks.

In the example here, we ran a spelling eval to evaluate how many words have a spelling error in a document and give a score between 1-10.

✅If every word had a spelling error, we’d expect a score of 10.
✅If no words had a spelling error, we’d expect a score of 0.
✅Everything in the middle would fall into the range with higher percentage of spelling errors landing closer to 10, and lower landing closer to 0.
✅Our expectation: the continuous score value should have some clear connection to the quantity of spelling errors in the paragraph.

‼️Our results however did not show that the score value had a consistent meaning. ‼️

In the example below, we have a paragraph where 80% of words have spelling errors, but has a score of a 10. We also have a paragraph with 10% of words having a spelling error with the same score of 10!

🧵 below is a rigorous study in how well LLMs handle continuous numeric ranges for LLM Evals.cookbook.openai.com/examples/evalu…Image
(2/9) What we did:

We inserted a quantifiable level of errors into a document. We asked an LLM to generate a score based on the % of words that had spelling errors.

The results are not great, almost 80% of the quantitative range is scored as the same number. Each point is a median of multiple random experiments with the same % corruption level.Image
(3/9) We ran 3 variations of the test with similar results.

Each variation had a different type of corruption:
1⃣ Spelling errors
2⃣Sadness - where we added sadness qualifiers to sentences
3⃣Frustration - where we added frustration qualifiers to a sentences

We tested on different context window sizes and the results were consistent. These results shown here are all on a 5k paragraph and run on @OpenAI GPT-4.Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(