merve Profile picture
Oct 10, 2024 6 tweets 3 min read Read on X
this is the BEST vision language model I have ever tried!

Aria is a new model by @rhymes_ai_: a 25.3B multimodal model that can take image/video inputs 🤩

They release the model with Apache-2.0 license and fine-tuning scripts as well 👏
I tested it extensively, keep reading to learn more 🧶
The model is open-sourced here: huggingface .co/rhymes-ai/Aria

The authors have released fine-tuning examples on RefCOCO, NextQA and NLVR and inference examples: github .com/rhymes-ai/Aria

Try the demo here: rhymes .ai

It's super nice that you can get started with this model using @huggingface transformers 🤗Image
I saw on the paper that it can debug screenshot of code??? 🤯
So I tried it on piece of code that calculates KL-div and it understood very well! Image
Image
The model has very impressive OCR capabilities even with the bad handwriting 📝 Image
Image
Real world knowledge ⇓ Image
Very good document understanding and reasoning skills (no need for CoT or fancy prompting)! 📑 Image
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with merve

merve Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mervenoyann

Feb 28
Microsoft released the most powerful vision language action model this week 🔥

MAGMA-8B can operate in both physical and digital world: embodied robots, web automation and more! 🤯 Image
Model: huggingface.co/microsoft/Magm…

Demo for UI agents: huggingface.co/spaces/microso…

New gaming agent demo based on Magma: huggingface.co/spaces/microso… Image
The authors follow a very typical multimodal architecture where they encode images and videos and concatenate with text reps and send to LLM

The flavor comes from two new prompt formats: Set-of-Mark for action grounding and Trace-of-Mark for action planning Image
Read 7 tweets
Jan 27
people who are baffled by DeepSeek have been and still are sleeping on Qwen, InternLM, ByteDance and Tencent

here's couple of fan-favorite models from them 🪭
Qwen2VL: best open-source vision language model
here's demo for 7B, use 72B for longer context length and better performance
huggingface.co/spaces/Ganymed…
Qwen released a new series of 1M context length models yesterday

here's an 14B onehuggingface.co/Qwen/Qwen2.5-1…
Read 8 tweets
Jan 9
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗

The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) for images and videos ⏯️
take a look 🧶 Image
The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

All of them are in this collection huggingface.co/collections/By…
The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️Image
Read 4 tweets
Jan 7
NVIDIA solved physics and open-sourced it? Can we just build our own autonomous robots now? 🤯

They released Cosmos: new family of open world foundation models (WFMs) 🌌

Unwrapping the release and why it's so revolutionary 🧶
They have released 7 models:
- four autoregressive video2world models
- four diffusion-based video2world and text2world models
- prompt upsampler
- content safety model

Collection on Hugging Face: huggingface.co/collections/nv…
Tech report: research.nvidia.com/publication/20…
Try them here: (try here build.nvidia.com/nvidia/cosmos-…
World Foundation Models (WFMs) are essentially pre-trained on open-world video data, which you can then fine-tune it on your specific application with less labels (be it autonomous driving or robotic arms)

This release matters so much for embodied applications because labelling frames for post-training of robotics models cost a lot of time, it will immensely accelerate the development of the embodied AI 🤖Image
Read 6 tweets
Dec 31, 2024
supercharge your LLM apps with smolagents 🔥

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by @huggingface to make the LLM write code, do analysis and automate boring stuff! Image
smolagents is a barebones library to unlock both native and traditional tool calling for language models

LLMs can already write code and do reasoning, so why bother yourself with writing the tool?

CodeAgent class is here for it! see it in action below
It is very easy to use CodeAgent!

Just initialize it with the tool of your choice and the model of your choice

See below how you can get started, you can use the models with HF Inference API as well as locally! Image
Read 5 tweets
Oct 25, 2024
Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥👏

OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing. 👏
Model:
Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.

no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.huggingface.co/microsoft/Omni…
Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. 👏 Image
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(