I'm excited to announce Reflection 70B, the world’s top open-source model.
Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.
405B coming next week - we expect it to be the best model in the world.
Built w/ @GlaiveAI.
Read on ⬇️:
Reflection 70B holds its own against even the top closed-source models (Claude 3.5 Sonnet, GPT-4o).
It’s the top LLM in (at least) MMLU, MATH, IFEval, GSM8K.
Beats GPT-4o on every benchmark tested.
It clobbers Llama 3.1 405B. It’s not even close.
The technique that drives Reflection 70B is simple, but very powerful.
Current LLMs have a tendency to hallucinate, and can’t recognize when they do so.
Reflection-Tuning enables LLMs to recognize their mistakes, and then correct them before committing to an answer.
Additionally, we separate planning into a separate step, improving CoT potency and keeping the outputs simple and concise for end users.
Important to note: We have checked for decontamination against all benchmarks mentioned using @lmsysorg's LLM Decontaminator.
The weights of our 70B model are available today on @huggingface here:
@hyperbolic_labs API available later today.
Next week, we will release the weights of Reflection-405B, along with a short report going into more detail on our process and findings.huggingface.co/mattshumer/Ref…
Most importantly, a huge shoutout to @csahil28 and @GlaiveAI.
I’ve been noodling on this idea for months, and finally decided to pull the trigger a few weeks ago. I reached out to Sahil and the data was generated within hours.
If you’re training models, check Glaive out.
This model is quite fun to use and insanely powerful.
Please check it out — with the right prompting, it’s an absolute beast for many use-cases.
This was made in partnership with @OctoAICloud — particularly Ben Hamm, who adapted my existing prompt optimization tools to take advantage of the new Llama 3.1 models.
This approach was inspired by this tweet that went viral months ago.
I discovered that if you prompt Haiku w/ Opus-generated examples, it can match Opus' quality.
Now, we have even better 'teacher' models than Opus, and cheaper 'student' models than Haiku.
An open-source Gemini 1.5 Pro agent that LISTENS to videos and delivers topical reports.
Just provide a topic, and a chain of AIs with access to YouTube will analyze relevant videos and generate a comprehensive report for you.
This uses the new Gemini 1.5 Pro API that was released today.
It currently only supports listening to the audio content of videos. If anyone wants, please feel free to add support for video frames as well.
How it works, in a nutshell:
- User provides a topic
- SERPAPI gathers relevant YouTube links
- A separate Gemini 1.5 instance listens to + summarizes each video
- A final Gemini instance takes in all of the summaries, and generates a final, comprehensive report
A very simple approach that combines the abilities of Claude 3, GPT-4, and Perplexity to provide better results than any could provide on their own.
Seriously -- it's dumb simple.
Notebook in thread:
How does it work?
The process is super simple. We simply query each model individually:
- Claude 3 Opus for reasoning + personality
- GPT-4 for reasoning
- PPLX for freshness/up-to-date info
Then, Claude combines the strengths of each and responds with a final, ideal output.
It's not perfect, but on average, it should improve results significantly compared to using models individually.
If anyone wants to improve it, there a lot of gains to be made by adding context about the strengths/weaknesses of each model in the final prompt.
A powerful Claude 3 research agent that delivers thorough reports in record time.
Just provide an topic, and a chain of AIs with **access to Google** will generate an incredibly comprehensive report for you.
And it's open-source!
`claude-researcher` is a constrained agent -- meaning its behavior is highly-controlled, leading to better results than open-ended agents.
It chains together lots of Claude 3 calls (and Google access) that work together to create a detailed report on a topic of your choice.
How it works, in a nutshell:
- User provides a topic
- Claude breaks it into sub-topics
- An agent with access to Google builds a report for each sub-topic
- A final Claude instance takes in all of the sub-topic reports, and generates a final, comprehensive report