Latest Twitter Threads by @jmin__cho on Thread Reader App

Nov 8, 2024 • 13 tweets • 6 min read

Check out M3DocRAG -- multimodal RAG for question answering on Multi-Modal & Multi-Page & Multi-Documents (+ a new open-domain benchmark + strong results on 3 benchmarks)!

⚡️Key Highlights:

➡️ M3DocRAG flexibly accommodates various settings:
- closed & open-domain document contexts (from a single-page doc to a corpus of many long docs)
- single & multi-hop questions
- diverse elements (text, table, image, etc.)

➡️ M3DocVQA is a new open-domain DocVQA benchmark where models should answer multi-hop questions (across multiple pages and documents) 3K+ PDFs (w/ 40K+ pages)

➡️ Strong results on 3 benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), including SoTA results on MP-DocVQA

🧵👇

Existing DocVQA works focus on one of two methods:

(a) using multimodal LMs on single-page documents
-> can't handle long documents

(b) using an OCR+RAG pipeline on many/longer documents
-> ignores visual elements such as figures

Jun 7, 2022 • 8 tweets • 5 min read

Want a captioning system to describe images in more detail & grammatically, but existing caption annotations are not fine-grained?

Check our #NAACL2022 Findings paper “Fine-grained Image Captioning with CLIP Reward”!

arxiv.org/abs/2205.13115

@AdobeResearch @uncnlp

🧵👇
(1/n)

https://twitter.com/ak92501/status/1530007802013417486

Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.

(2/n)

Feb 5, 2021 • 6 tweets • 4 min read

Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!

Arxiv: arxiv.org/abs/2102.02779

Work done w/ @jayleicn @HaoTan5 @mohitban47 (@uncnlp)

🧵1/n

Existing methods for V+L learning typically require designing task-speciﬁc architectures and objectives for each task.
For example, a multi-label answer classiﬁer for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.

Share this page!

Enter URL or ID to Unroll