Matt Shumer Profile picture
Sep 5 9 tweets 3 min read Read on X
I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week - we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on ⬇️:Image
Reflection 70B holds its own against even the top closed-source models (Claude 3.5 Sonnet, GPT-4o).

It’s the top LLM in (at least) MMLU, MATH, IFEval, GSM8K.

Beats GPT-4o on every benchmark tested.

It clobbers Llama 3.1 405B. It’s not even close. Image
The technique that drives Reflection 70B is simple, but very powerful.

Current LLMs have a tendency to hallucinate, and can’t recognize when they do so.

Reflection-Tuning enables LLMs to recognize their mistakes, and then correct them before committing to an answer. Image
Additionally, we separate planning into a separate step, improving CoT potency and keeping the outputs simple and concise for end users. Image
Important to note: We have checked for decontamination against all benchmarks mentioned using @lmsysorg's LLM Decontaminator.
The weights of our 70B model are available today on @huggingface here:

@hyperbolic_labs API available later today.

Next week, we will release the weights of Reflection-405B, along with a short report going into more detail on our process and findings.huggingface.co/mattshumer/Ref…
Most importantly, a huge shoutout to @csahil28 and @GlaiveAI.

I’ve been noodling on this idea for months, and finally decided to pull the trigger a few weeks ago. I reached out to Sahil and the data was generated within hours.

If you’re training models, check Glaive out.
This model is quite fun to use and insanely powerful.

Please check it out — with the right prompting, it’s an absolute beast for many use-cases.

Demo here: …-playground-production.up.railway.app
405B is coming next week, and we expect it to outperform Sonnet and GPT-4o by a wide margin.

But this is just the start. I have a few more tricks up my sleeve.

I’ll continue to work with @csahil28 to release even better LLMs that make this one look like a toy.

Stay tuned.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Matt Shumer

Matt Shumer Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mattshumer_

Jul 26
Introducing `llama-405b-to-8b` ✍️

Get the quality of Llama 3.1 405B, at a fraction of the cost and latency.

Give one example of your task, and 405B will teach 8B (~30x cheaper!!) how to do the task perfectly.

And it's open-source: github.com/mshumer/gpt-pr…
This was made in partnership with @OctoAICloud — particularly Ben Hamm, who adapted my existing prompt optimization tools to take advantage of the new Llama 3.1 models.
This approach was inspired by this tweet that went viral months ago.

I discovered that if you prompt Haiku w/ Opus-generated examples, it can match Opus' quality.

Now, we have even better 'teacher' models than Opus, and cheaper 'student' models than Haiku.
Read 8 tweets
Jul 22
Introducing `claude-sonnet-to-gpt-4o-mini` ✍️

Get the quality of Claude 3.5 Sonnet, at a fraction of the cost and latency.

Give one example of your task, and Sonnet will teach 4o-mini (20x cheaper!!) how to do the task perfectly.

And it's open-source: shorturl.at/Cjjwt
This repo was inspired by this tweet that went viral months ago.

I discovered that if you prompt Haiku w/ Opus-generated examples, it can match Opus' quality.

Now, we have even better 'teacher' models than Opus, and cheaper 'student' models than Haiku.

In production, Claude 3.5 Sonnet-level AI quality at a low cost, with near-instant results, is a game changer.

This notebook makes it possible for anyone to implement this quickly.

So how does it work?
Read 7 tweets
Apr 10
Introducing `gemini-youtube-researcher` 📈

An open-source Gemini 1.5 Pro agent that LISTENS to videos and delivers topical reports.

Just provide a topic, and a chain of AIs with access to YouTube will analyze relevant videos and generate a comprehensive report for you.
This uses the new Gemini 1.5 Pro API that was released today.

It currently only supports listening to the audio content of videos. If anyone wants, please feel free to add support for video frames as well.
How it works, in a nutshell:
- User provides a topic
- SERPAPI gathers relevant YouTube links
- A separate Gemini 1.5 instance listens to + summarizes each video
- A final Gemini instance takes in all of the summaries, and generates a final, comprehensive report
Read 4 tweets
Apr 8
Open-sourcing `AI-Oracle`.

Generates better responses than Claude 3 Opus.

A very simple approach that combines the abilities of Claude 3, GPT-4, and Perplexity to provide better results than any could provide on their own.

Seriously -- it's dumb simple.

Notebook in thread:
How does it work?

The process is super simple. We simply query each model individually:
- Claude 3 Opus for reasoning + personality
- GPT-4 for reasoning
- PPLX for freshness/up-to-date info

Then, Claude combines the strengths of each and responds with a final, ideal output.
It's not perfect, but on average, it should improve results significantly compared to using models individually.

If anyone wants to improve it, there a lot of gains to be made by adding context about the strengths/weaknesses of each model in the final prompt.
Read 4 tweets
Apr 5
Introducing `claude-researcher` 📈

A powerful Claude 3 research agent that delivers thorough reports in record time.

Just provide an topic, and a chain of AIs with **access to Google** will generate an incredibly comprehensive report for you.

And it's open-source!
`claude-researcher` is a constrained agent -- meaning its behavior is highly-controlled, leading to better results than open-ended agents.

It chains together lots of Claude 3 calls (and Google access) that work together to create a detailed report on a topic of your choice.
How it works, in a nutshell:
- User provides a topic
- Claude breaks it into sub-topics
- An agent with access to Google builds a report for each sub-topic
- A final Claude instance takes in all of the sub-topic reports, and generates a final, comprehensive report
Read 6 tweets
Apr 3
Introducing `Claude-Author` 📕✍️

One prompt -> an entire novel!

Just describe the high-level details, and a chain of AI systems will write an entire book for you in minutes.

- complete w/ cover art
- packages your book as a real e-book

And it's open-source!
Previous AI book-writing systems produced mildly interesting books that were filled with errors and quite boring.

Claude-Author is the first AI system that actually produces readable books.

Still not perfect, but it's a leaps and bounds improvement over previous approaches.
Claude-Author is a constrained agent -- meaning its behavior is highly-controlled, leading to better results than open-ended agents.

It's a chain of many Claude 3 Haiku/Opus and Stable Diffusion API calls that work together to write a coherent novel.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(