Want a captioning system to describe images in more detail & grammatically, but existing caption annotations are not fine-grained?

Check our #NAACL2022 Findings paper “Fine-grained Image Captioning with CLIP Reward”!

arxiv.org/abs/2205.13115

@AdobeResearch @uncnlp

🧵👇
(1/n)
Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.

(2/n)
We found that using CLIP-S (@jmhessel etal) as reward provides such fine-grained guidance; but we also found that the model trained with it degenerates with repeated words. Since CLIP is trained only with a contrastive objective, its text encoder doesn't care about grammar

(3/n)
To address this, we next inject grammar knowledge into CLIP, by finetuning its text encoder w/o requiring extra grammar annotations. We create negative sentences by editing original ones, and learn an MLP head to classify whether a sentence is grammatically correct or not.

(4/n)
The grammar score successfully addresses the text degeneration problem!

(5/n)
To comprehensively diagnose the aspect of caption descriptiveness / fine-grainedness, we introduce FineCapEval, a fine-grained caption evaluation dataset.

(6/n)
In our experiment, training with our CLIP-S + grammar reward provides more fine-grained captions and outperforms other rewards on FineCapEval across the board.
In addition, human evaluation also strongly prefers our approach to MLE & CIDEr-reward model baselines.

(7/n)
Code: github.com/j-min/CLIP-Cap…

Thanks to all collaborators
@david_s_yoon @ajinkyakale @FranckDernoncou TrungBui @mohitban47
and reviewers for the feedback!
And thanks @ak92501 for the original tweet!

(8/n)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jaemin Cho (on faculty job market) @ NeurIPS 2024

Jaemin Cho (on faculty job market) @ NeurIPS 2024 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jmin__cho

Nov 8, 2024
Check out M3DocRAG -- multimodal RAG for question answering on Multi-Modal & Multi-Page & Multi-Documents (+ a new open-domain benchmark + strong results on 3 benchmarks)!

⚡️Key Highlights:

➡️ M3DocRAG flexibly accommodates various settings:
- closed & open-domain document contexts (from a single-page doc to a corpus of many long docs)
- single & multi-hop questions
- diverse elements (text, table, image, etc.)

➡️ M3DocVQA is a new open-domain DocVQA benchmark where models should answer multi-hop questions (across multiple pages and documents) 3K+ PDFs (w/ 40K+ pages)

➡️ Strong results on 3 benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), including SoTA results on MP-DocVQA

🧵👇Image
Existing DocVQA works focus on one of two methods:

(a) using multimodal LMs on single-page documents
-> can't handle long documents

(b) using an OCR+RAG pipeline on many/longer documents
-> ignores visual elements such as figures
M3DocRAG consists of 3 stages:

1) Extract visual embedding (e.g., w/ ColPali) from each page image.

2) Retrieve top-K pages (+ approximate indexing for faster search in open-domain setting)

3) Generate an answer with a multimodal LM (e.g., Qwen2-VL) given the retrieved K pages.Image
Read 13 tweets
Feb 5, 2021
Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!

Arxiv: arxiv.org/abs/2102.02779

Work done w/ @jayleicn @HaoTan5 @mohitban47 (@uncnlp)

🧵1/n
Existing methods for V+L learning typically require designing task-specific architectures and objectives for each task.
For example, a multi-label answer classifier for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.
To alleviate these hassles, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the V+L inputs.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(