this is the BEST vision language model I have ever tried!
Aria is a new model by @rhymes_ai_: a 25.3B multimodal model that can take image/video inputs ๐คฉ
They release the model with Apache-2.0 license and fine-tuning scripts as well ๐
I tested it extensively, keep reading to learn more ๐งถ
The model is open-sourced here: huggingface .co/rhymes-ai/Aria
The authors have released fine-tuning examples on RefCOCO, NextQA and NLVR and inference examples: github .com/rhymes-ai/Aria
Try the demo here: rhymes .ai
It's super nice that you can get started with this model using @huggingface transformers ๐ค
I saw on the paper that it can debug screenshot of code??? ๐คฏ
So I tried it on piece of code that calculates KL-div and it understood very well!
The model has very impressive OCR capabilities even with the bad handwriting ๐
Real world knowledge โ
Very good document understanding and reasoning skills (no need for CoT or fancy prompting)! ๐
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.