Thread by @mervenoyann on Thread Reader App

this is the BEST vision language model I have ever tried!

Aria is a new model by @rhymes_ai_: a 25.3B multimodal model that can take image/video inputs 🤩

They release the model with Apache-2.0 license and fine-tuning scripts as well 👏
I tested it extensively, keep reading to learn more 🧶

The model is open-sourced here: huggingface .co/rhymes-ai/Aria

The authors have released fine-tuning examples on RefCOCO, NextQA and NLVR and inference examples: github .com/rhymes-ai/Aria

Try the demo here: rhymes .ai

It's super nice that you can get started with this model using @huggingface transformers 🤗

I saw on the paper that it can debug screenshot of code??? 🤯
So I tried it on piece of code that calculates KL-div and it understood very well!

The model has very impressive OCR capabilities even with the bad handwriting 📝

Real world knowledge ⇓

Very good document understanding and reasoning skills (no need for CoT or fancy prompting)! 📑

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll