merve Profile picture
open-sourceress at @huggingface ๐Ÿง™๐Ÿปโ€โ™€๏ธ proud Mediterrenean ๐Ÿ‹ I work on: zero-shot vision & VLMs, large multimodal models, transformers, smol-vision

Oct 10, 2024, 6 tweets

this is the BEST vision language model I have ever tried!

Aria is a new model by @rhymes_ai_: a 25.3B multimodal model that can take image/video inputs ๐Ÿคฉ

They release the model with Apache-2.0 license and fine-tuning scripts as well ๐Ÿ‘
I tested it extensively, keep reading to learn more ๐Ÿงถ

The model is open-sourced here: huggingface .co/rhymes-ai/Aria

The authors have released fine-tuning examples on RefCOCO, NextQA and NLVR and inference examples: github .com/rhymes-ai/Aria

Try the demo here: rhymes .ai

It's super nice that you can get started with this model using @huggingface transformers ๐Ÿค—

I saw on the paper that it can debug screenshot of code??? ๐Ÿคฏ
So I tried it on piece of code that calculates KL-div and it understood very well!

The model has very impressive OCR capabilities even with the bad handwriting ๐Ÿ“

Real world knowledge โ‡“

Very good document understanding and reasoning skills (no need for CoT or fancy prompting)! ๐Ÿ“‘

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling