elvis Profile picture
Building with AI agents @dair_ai • Prev: Meta AI, Galactica LLM, Elastic, Papers with Code, PhD • I build and teach about LLMs & AI Agents ⬇️
21 subscribers
Feb 18 23 tweets 7 min read
BREAKING: xAI announces Grok 3

Here is everything you need to know: Image Elon mentioned that Grok 3 is an order of magnitude more capable than Grok 2. Image
Feb 15 8 tweets 2 min read
Introducing... Agent Leaderboard!

Many devs ask me which LLMs work best for AI agents.

The new Agent Leaderboard (by @rungalileo) was built to provide insights and evaluate LLMs on real-world tool-calling tasks—crucial for building AI agents.

Let's go over the results: Image 1️⃣ Leader

After evaluating 17 leading LLMs across 14 diverse datasets, here are the key findings:

Google's 𝗚𝗲𝗺𝗶𝗻𝗶-𝟮.𝟬-𝗳𝗹𝗮𝘀𝗵 leads with a 0.94 score at a remarkably low cost.
Jan 23 16 tweets 4 min read
OpenAI Introduces Operator & Agents!

Here is everything you need to know: Image Operator is a system that can use a web browser to accomplish tasks.

Operator can look at a webpage and interact with it by typing, clicking, and scrolling.

It's available as a research preview. Available in the US for Pro users. Available to Plus users later.
Jan 21 4 tweets 2 min read
Goodbye web scrapers!

Say hello to /extract by @firecrawl_dev

Just write a prompt and get the web data you need!

It doesn’t get any simpler than this. The /extract endpoint is simple to use. Provide a prompt and a schema and retrieve any data you need from a website.

I’ve added the /* to the URL to find and extract information across the entire website.

The endpoint can return up to thousands of data points at once.
Jan 20 4 tweets 2 min read
The DeepSeek-R1 paper is a gem!

Highly encourage everyone to read it.

It's clear that LLM reasoning capabilities can be learned in different ways.

RL, if applied correctly and at scale, can lead to some really powerful and interesting scaling and emergent properties.

There is more to RL than meets the eye!

Here is my breakdown of the paper along with a few tests: youtu.be/3GlFd3doO3U?si…

The multi-state training might not make sense initially but they provide clues on optimizations that we can continue to tap into.

Data quality is still very important for enhancing the usability of the LLM.

Unlike other reasoning LLMs, DeepSeek-R1's training recipe and weights are open so we can build on top of it. This opens up exciting research opportunities.

About the attached clip: the previous preview model wasn't able to solve this task. DeepSeek-R1 can solve this and many other tasks that o1 can solve. It's a very good model for coding and math. When DeepSeek said "on par with OpenAI-o1" I thought they were just hyping. But based on my tests, it's clearly not so.

Wanted to add that DeepSeek-R1 got all of the hard tasks from the OpenAI LLM reasoning blog post correct for me. This is wild and totally unexpected! The only task where it failed (i.e., crossword puzzle) o1 also fails.Image
Jan 8 14 tweets 4 min read
Agents Overview

Great write-up on Agents by Chip.

Here are my takeaways: Image 🤖 Agents Overview

An AI agent is made up of both the environment it operates in (e.g., a game, the internet, or computer system) and the set of actions it can perform through its available tools. This dual definition is fundamental to understanding how agents work.
Jan 6 13 tweets 5 min read
Google recently published this great whitepaper on Agents.

2025 is going to be a huge year for AI Agents.

Here's what's included:

- Introduction to AI Agents
- The role of tools in Agents
- Enhancing model performance with targeted learning
- Quick start to Agents with LangChain
- Production applications with Vertex AI Agents

Great place to start learning about AI Agents.Image kaggle.com/whitepaper-age…
Dec 17, 2024 14 tweets 3 min read
Summary of today's OpenAI announcement (Day 9):

- o1 is launching out of preview in the API
- support for function calling, structured output, and developer messages
- reasoning_effort parameter to tell the model how much effort to spend on thinking
- vision inputs in the API is here too Visual inputs with developer message (this is a new spin to system message for better steering the model) inside of the OpenAI Playground Image
Dec 6, 2024 8 tweets 2 min read
Summary of today's OpenAI announcement:

- introduces reinforcement fine-tuning (RFT) of o1
- tune o1 to learn to reason in new ways in custom domains
- RFT is better and more efficient than regular fine-tuning; needs just a few examples

1/n
How it looks in the dev platform. Examples show how to select RFT on o1-mini Image
Jul 18, 2024 7 tweets 2 min read
That's right! It's a huge week for small language models (SLMs)

Few new SLMs on my radar: Mistral NeMo

Highlights:
- Introduced by Mistral + NVIDIA
- Apache 2.0 license
- outperforms Gemma 2 9B and Llama 2 8B
- multilingual capabilities
- efficient tokenizer (Tekken)

Feb 21, 2024 4 tweets 3 min read
JUST IN: Google DeepMind releases Gemma, a series of open models inspired by the same research and tech used for Gemini.

Open models fit various use cases so this is a very smart move from Google.

Great to see that Google recognizes the importance of openness in AI science and technology.

There are 2B (trained on 2T tokens) and 7B (trained on 6T tokens) models including base and instruction tuned versions. Trained on a context length of 8192 tokens.

Commercial use is allowed.

These are not multimodal models but based on the reported experimental results they appear to outperform Llama 2 7B and Mistral 7B.

I am excited about those MATH, HumanEval, GSM8K, and AGIEval results. These are really incredible results for a model this size.

Excited to dive deeper into these models. The model prompting guide is dropping soon. Stay tuned!Image Blog:

Technical report: blog.google/technology/dev…
storage.googleapis.com/deepmind-media…
Dec 6, 2023 16 tweets 7 min read
Gemini is here!

Google DeepMind just announced Gemini, their largest and most capable AI model.

A short summary of all you need to know:

1) What it is - Built with multimodal support from the ground up. Remarkable multimodal reasoning capabilities across text, images, video, audio, and code. Nano, Pro, and Ultra models are available to support different scenarios such as efficiency/scale and support complex capabilities.

2) Performance - The results on the standard benchmarks (MMLU, HumanEval, Big-Bench-Hard, etc.) show improvement compared to GPT-4 (though not by a lot). Still very impressive!

3) Outperforming human experts - They claim that Gemini is the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), a popular benchmark to test the knowledge and problem-solving abilities of AI models.

4) Capabilities- Gemini surpasses SOTA performance on a bunch of multimodal tasks like infographic understanding and mathematical reasoning in visual contexts. There was a lot of focus on multimodal reasoning capabilities with the ability to analyze documents and uncover knowledge that's hard to discern. The model capabilities reported are multimodality, multilinguality, factuality, summarization, math/science, long-context, reasoning, and more. It's probably one of the most capable models by the looks of it.

5) Trying it out - Apparently, a fine-tuned Gemini Pro is available to use via Bard. Can't wait to experiment with this soon.

6) Availability - Models will be made available for devs on Google AI Studio and Google Cloud Vertex AI by Dec 13th.

blog:

technical report:
Image Here is the model verifying a student's solution to a physics problem. Huge implications in education. Will be taking a very close look at applications here. Image
Aug 2, 2023 5 tweets 2 min read
You can now connect Jupyter with LLMs!

It provides an AI chat-based assistant within the Jupyter environment that allows you to generate code, summarize content, create comments, fix errors, etc.

You can even generate entire notebooks using text prompts!

You can also pass it… https://t.co/12DlystPJOtwitter.com/i/web/status/1…
Image Official announcement: blog.jupyter.org/generative-ai-…
Jun 25, 2023 8 tweets 2 min read
How can you build your own custom ChatGPT-like system on your data?

This is not easy as it could require complex architecture and pipelines.

Given the high demand, I started to explore the ChatLLM feature by @abacusai.

I’m very impressed! Let's take a look at how it works: Everyone has a knowledge base or data sitting around, like wiki pages, documentation, customer tickets, etc.

With ChatLLM you can quickly create a chat app, like ChatGPT, that helps you discover and answer questions about your data.
Jun 22, 2023 7 tweets 3 min read
MosaicML just released MPT-30B!

The previous model they released was 7B. MPT-30B is an open-source model licensed for commercial use that is more powerful than MPT-7B.

8K context and 2 fine-tuned variants: MPT-30B-Instruct and MPT-30B-Chat.

https://t.co/rk3gdr8Ig8mosaicml.com/blog/mpt-30b
Chat demo here: huggingface.co/spaces/mosaicm…
Jun 8, 2023 5 tweets 2 min read
Wolfram Prompt Repository

A collection of ~200 basic and advanced LLM prompts for accomplishing tasks that range from fun to technical.

They are specific to Wolfram but there are a lot of interesting ideas on how prompts can be used and designed.

writings.stephenwolfram.com/2023/06/prompt… ImageImage Example 1: summarize the text in the style of an academic abstract Image
Jun 2, 2023 7 tweets 3 min read
As an ML engineer, I’ve spent a lot of time building forecasting models.

Now, I don’t have to build complex time series models or understand market signals.

Akkio makes forecasting quick, easy, and accurate with predictive AI.

Here’s how: Last week, I hosted a webinar with @AkkioHQ, and here’s a glimpse at what we covered:

-Challenges working with time series data and forecasting models
- Best practices for cleaning and preparing data to improve forecasting
- Several use cases like web traffic forecasting and… twitter.com/i/web/status/1…
May 25, 2023 6 tweets 3 min read
Finetuning LLMs to call APIs

Present Gorilla, a finetuned LLaMA-based model that surpasses GPT-4 on writing API calls. This capability can help identify the right API, boosting the ability of LLMs to interact with external tools to complete specific tasks.

Huge potential for… twitter.com/i/web/status/1… Image What the system looks like.

The top part shows how the model is trained.

The bottom part shows how it performs inference (either using retrieval or zero-shot).

This seems like a really important layer for improving the reliability and effectiveness of LLMs that interact… twitter.com/i/web/status/1… Image
Apr 21, 2023 13 tweets 6 min read
Exciting new updates from Bard!

Bard now helps with code generation tasks like debugging and code explanation. Supports over 20 programming languages like C++ and Python.

The Export to Colab option is brilliant for quick experimentation and refinement. I love this!… twitter.com/i/web/status/1… Started to test it a bit.

I asked, "Can you please help me implement a basic RNN and test it on dummy text data?"

Then I exported to generated code to Google Colab. One section of the code wasn't working... Image
Apr 20, 2023 4 tweets 2 min read
Prompt engineering tools are appearing everywhere!

We're also seeing a set of standardized prompt engineering techniques and tools to build effectively with LLMs.

Just this week: W&B Prompts (from @weights_biases): tools to support prompt engineering; allows for debugging LLMs apps interactively, view latency, and other tracking features

Apr 18, 2023 4 tweets 3 min read
LLM-based agents for performing complex scientific experiments.

Really interesting paper on developing agents based on LLMs for autonomous design, planning, and execution of scientific experiments. If you're looking for good papers on LLMs, you should read this one.… twitter.com/i/web/status/1… Image We are also starting to see more use of vector search for improving the results of LLMs on more complex tasks. This is particularly important when the number of tokens you can pass to the LLM is limited. Selecting relevant docs is an effective approach. Image