Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.
(2/n)
We found that using CLIP-S (@jmhessel etal) as reward provides such fine-grained guidance; but we also found that the model trained with it degenerates with repeated words. Since CLIP is trained only with a contrastive objective, its text encoder doesn't care about grammar
(3/n)
To address this, we next inject grammar knowledge into CLIP, by finetuning its text encoder w/o requiring extra grammar annotations. We create negative sentences by editing original ones, and learn an MLP head to classify whether a sentence is grammatically correct or not.
(4/n)
The grammar score successfully addresses the text degeneration problem!
(5/n)
To comprehensively diagnose the aspect of caption descriptiveness / fine-grainedness, we introduce FineCapEval, a fine-grained caption evaluation dataset.
(6/n)
In our experiment, training with our CLIP-S + grammar reward provides more fine-grained captions and outperforms other rewards on FineCapEval across the board.
In addition, human evaluation also strongly prefers our approach to MLE & CIDEr-reward model baselines.
Check out M3DocRAG -- multimodal RAG for question answering on Multi-Modal & Multi-Page & Multi-Documents (+ a new open-domain benchmark + strong results on 3 benchmarks)!
⚡️Key Highlights:
➡️ M3DocRAG flexibly accommodates various settings:
- closed & open-domain document contexts (from a single-page doc to a corpus of many long docs)
- single & multi-hop questions
- diverse elements (text, table, image, etc.)
➡️ M3DocVQA is a new open-domain DocVQA benchmark where models should answer multi-hop questions (across multiple pages and documents) 3K+ PDFs (w/ 40K+ pages)
➡️ Strong results on 3 benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), including SoTA results on MP-DocVQA
🧵👇
Existing DocVQA works focus on one of two methods:
(a) using multimodal LMs on single-page documents
-> can't handle long documents
(b) using an OCR+RAG pipeline on many/longer documents
-> ignores visual elements such as figures
M3DocRAG consists of 3 stages:
1) Extract visual embedding (e.g., w/ ColPali) from each page image.
2) Retrieve top-K pages (+ approximate indexing for faster search in open-domain setting)
3) Generate an answer with a multimodal LM (e.g., Qwen2-VL) given the retrieved K pages.
Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!
Existing methods for V+L learning typically require designing task-specific architectures and objectives for each task.
For example, a multi-label answer classifier for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.
To alleviate these hassles, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the V+L inputs.