Post

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @mervenoyann

merve

@mervenoyann

Oct 7

I'm bullish on this foundation OCR model called GOT 📝 @eccvconf

This model can transcribe anything and it's Apache-2.0!

Keep reading to learn more 🧶

This model can take in screenshots of tables/LaTeX and output formatted text, music sheets, charts, literally anything to meaningful format!

Try it huggingface.co/spaces/stepfun…

This model has the same architecture as other vision language models 👀 Consists of an image encoder, projector and text decoder.

What makes this model special in my opinion are two things:
1. Diverse, high quality data mixture (thus data engine)
2. Alignment technique

Read 6 tweets

merve

@mervenoyann

Sep 4

I was giving consultancy about multimodal RAG to some external people, wanted to post it here

stop using OCR + LLMs. if you want to retrieve, use ColPali, if you want RAG from docs, use vision language models

https://twitter.com/mervenoyann/status/1831416247344623632

my tweet might come off a bit sensational, but I'm mostly talking about documents that aren't just only plain text. leaving these here, pdf parser will parse incorrectly and slowly

https://twitter.com/mervenoyann/status/1831416247344623632

some folks had hard time with this advice

just index and retrieve documents using ColPali, then feed the retrieved document (as context) along with text query to a VLM (Qwen2-VL, MiniCPMV, Llava, any VLM)
read more huggingface.co/blog/manu/colp…

Read 5 tweets

merve

@mervenoyann

Aug 1

it's more convoluted to build a demo to SAM with text to box prompting

one needs to find first frame where the object of interest appears, then give it to SAM2 as a mask and let it propagate
github.com/facebookresear…

one technical problem I had was for each text prompt one needs to tune threshold for text -> box prediction for a given frame to make sure right mask is given

ofc it's not a problem, but also there's gradio not having a component to pick frames like in Meta's demo

but also it defeats the purpose of text to masklet prediction, because simply entering text should have been enough
I guess one can sample n frames, get some masks automatically and add_new_mask but again they might be the same masklet

Read 4 tweets

merve

@mervenoyann

Jul 10

Forget any document retrievers, use ColPali 💥💥

Document retrieval is done through OCR + layout detection, but it's overkill and doesn't work well! 🤓

ColPali uses a vision language model, which is better in doc understanding 📑
keep reading ⤋

ColPali: (mit license!)
Blog post:

The authors also released a new benchmark for document retrieval, ViDoRe Leaderboard, submit your model! huggingface.co/vidore/colpali
huggingface.co/blog/manu/colp…
huggingface.co/spaces/vidore/…

Regular document retrieval systems use OCR + layout detection + another model to retrieve information from documents, and then use output representations in applications like RAG 🥲

Meanwhile modern image encoders demonstrate out-of-the-box document understanding capabilities!

Read 7 tweets

merve

@mervenoyann

Jul 1

Real-time DEtection Transformer (RT-DETR) landed in @huggingface transformers 🤩 with Apache 2.0 license 😍

do DETRs Beat YOLOs on Real-time Object Detection?
keep reading 👀

@huggingface short answer, it does!

📖 notebook:
🔖 models:
🔖 demo:
📝 paper: github.com/merveenoyan/ex…
huggingface.co/models?search=…
huggingface.co/spaces/merve/R…
huggingface.co/papers/2304.08…

@huggingface YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS 🥲
Transformer-based models on the other hand are computationally not as efficient 🥲
Isn't there something in between? Enter RT-DETR!

Read 6 tweets

merve

@mervenoyann

Apr 22

DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝
time to dive in and learn more 🧶

This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself

Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM

Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen

Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM

Read 8 tweets

Share this page!

Enter URL or ID to Unroll

merve

Try unrolling a thread yourself!

More from @mervenoyann

merve

merve

merve

merve

merve

merve

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!