I was giving consultancy about multimodal RAG to some external people, wanted to post it here
stop using OCR + LLMs. if you want to retrieve, use ColPali, if you want RAG from docs, use vision language models
my tweet might come off a bit sensational, but I'm mostly talking about documents that aren't just only plain text. leaving these here, pdf parser will parse incorrectly and slowly
just index and retrieve documents using ColPali, then feed the retrieved document (as context) along with text query to a VLM (Qwen2-VL, MiniCPMV, Llava, any VLM)
read more huggingface.co/blog/manu/colp…
it's more convoluted to build a demo to SAM with text to box prompting
one needs to find first frame where the object of interest appears, then give it to SAM2 as a mask and let it propagate github.com/facebookresear…
one technical problem I had was for each text prompt one needs to tune threshold for text -> box prediction for a given frame to make sure right mask is given
ofc it's not a problem, but also there's gradio not having a component to pick frames like in Meta's demo
but also it defeats the purpose of text to masklet prediction, because simply entering text should have been enough
I guess one can sample n frames, get some masks automatically and add_new_mask but again they might be the same masklet
Regular document retrieval systems use OCR + layout detection + another model to retrieve information from documents, and then use output representations in applications like RAG 🥲
Meanwhile modern image encoders demonstrate out-of-the-box document understanding capabilities!
@huggingface YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS 🥲
Transformer-based models on the other hand are computationally not as efficient 🥲
Isn't there something in between? Enter RT-DETR!