merve Profile picture
Oct 10 6 tweets 3 min read Read on X
this is the BEST vision language model I have ever tried!

Aria is a new model by @rhymes_ai_: a 25.3B multimodal model that can take image/video inputs 🤩

They release the model with Apache-2.0 license and fine-tuning scripts as well 👏
I tested it extensively, keep reading to learn more 🧶
The model is open-sourced here: huggingface .co/rhymes-ai/Aria

The authors have released fine-tuning examples on RefCOCO, NextQA and NLVR and inference examples: github .com/rhymes-ai/Aria

Try the demo here: rhymes .ai

It's super nice that you can get started with this model using @huggingface transformers 🤗Image
I saw on the paper that it can debug screenshot of code??? 🤯
So I tried it on piece of code that calculates KL-div and it understood very well! Image
Image
The model has very impressive OCR capabilities even with the bad handwriting 📝 Image
Image
Real world knowledge ⇓ Image
Very good document understanding and reasoning skills (no need for CoT or fancy prompting)! 📑 Image
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with merve

merve Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mervenoyann

Oct 7
I'm bullish on this foundation OCR model called GOT 📝 @eccvconf

This model can transcribe anything and it's Apache-2.0!

Keep reading to learn more 🧶Image
This model can take in screenshots of tables/LaTeX and output formatted text, music sheets, charts, literally anything to meaningful format!

Try it huggingface.co/spaces/stepfun…Image
This model has the same architecture as other vision language models 👀 Consists of an image encoder, projector and text decoder.

What makes this model special in my opinion are two things:
1. Diverse, high quality data mixture (thus data engine)
2. Alignment technique
Read 6 tweets
Sep 4
I was giving consultancy about multimodal RAG to some external people, wanted to post it here

stop using OCR + LLMs. if you want to retrieve, use ColPali, if you want RAG from docs, use vision language models Image
my tweet might come off a bit sensational, but I'm mostly talking about documents that aren't just only plain text. leaving these here, pdf parser will parse incorrectly and slowly
Image
some folks had hard time with this advice

just index and retrieve documents using ColPali, then feed the retrieved document (as context) along with text query to a VLM (Qwen2-VL, MiniCPMV, Llava, any VLM)
read more huggingface.co/blog/manu/colp…
Image
Read 5 tweets
Aug 1
it's more convoluted to build a demo to SAM with text to box prompting

one needs to find first frame where the object of interest appears, then give it to SAM2 as a mask and let it propagate
github.com/facebookresear…
one technical problem I had was for each text prompt one needs to tune threshold for text -> box prediction for a given frame to make sure right mask is given

ofc it's not a problem, but also there's gradio not having a component to pick frames like in Meta's demo
but also it defeats the purpose of text to masklet prediction, because simply entering text should have been enough
I guess one can sample n frames, get some masks automatically and add_new_mask but again they might be the same masklet
Read 4 tweets
Jul 10
Forget any document retrievers, use ColPali 💥💥

Document retrieval is done through OCR + layout detection, but it's overkill and doesn't work well! 🤓

ColPali uses a vision language model, which is better in doc understanding 📑
keep reading ⤋ Image
ColPali: (mit license!)
Blog post:

The authors also released a new benchmark for document retrieval, ViDoRe Leaderboard, submit your model! huggingface.co/vidore/colpali
huggingface.co/blog/manu/colp…
huggingface.co/spaces/vidore/…
Image
Regular document retrieval systems use OCR + layout detection + another model to retrieve information from documents, and then use output representations in applications like RAG 🥲

Meanwhile modern image encoders demonstrate out-of-the-box document understanding capabilities!
Read 7 tweets
Jul 1
Real-time DEtection Transformer (RT-DETR) landed in @huggingface transformers 🤩 with Apache 2.0 license 😍

do DETRs Beat YOLOs on Real-time Object Detection?
keep reading 👀
@huggingface short answer, it does!

📖 notebook:
🔖 models:
🔖 demo:
📝 paper: github.com/merveenoyan/ex…
huggingface.co/models?search=…
huggingface.co/spaces/merve/R…
huggingface.co/papers/2304.08…
@huggingface YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS 🥲
Transformer-based models on the other hand are computationally not as efficient 🥲
Isn't there something in between? Enter RT-DETR!
Read 6 tweets
Apr 22
DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝
time to dive in and learn more 🧶 Image
This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself

Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM Image
Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen

Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(