Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Jaemin Cho

@jmin__cho

Jun 7, 2022 • 8 tweets • 5 min read • Read on X

Scrolly

@AdobeResearch

Want a captioning system to describe images in more detail & grammatically, but existing caption annotations are not fine-grained?

Check our #NAACL2022 Findings paper “Fine-grained Image Captioning with CLIP Reward”!

arxiv.org/abs/2205.13115

@AdobeResearch @uncnlp

🧵👇
(1/n)

https://twitter.com/ak92501/status/1530007802013417486

Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.

(2/n)

@jmhessel

We found that using CLIP-S (@jmhessel etal) as reward provides such fine-grained guidance; but we also found that the model trained with it degenerates with repeated words. Since CLIP is trained only with a contrastive objective, its text encoder doesn't care about grammar

(3/n)

To address this, we next inject grammar knowledge into CLIP, by finetuning its text encoder w/o requiring extra grammar annotations. We create negative sentences by editing original ones, and learn an MLP head to classify whether a sentence is grammatically correct or not.

(4/n)

The grammar score successfully addresses the text degeneration problem!

(5/n)

To comprehensively diagnose the aspect of caption descriptiveness / fine-grainedness, we introduce FineCapEval, a fine-grained caption evaluation dataset.

(6/n)

In our experiment, training with our CLIP-S + grammar reward provides more fine-grained captions and outperforms other rewards on FineCapEval across the board.
In addition, human evaluation also strongly prefers our approach to MLE & CIDEr-reward model baselines.

(7/n)

@david_s_yoon

Code: github.com/j-min/CLIP-Cap…

Thanks to all collaborators
@david_s_yoon @ajinkyakale @FranckDernoncou TrungBui @mohitban47
and reviewers for the feedback!
And thanks @ak92501 for the original tweet!

(8/n)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @jmin__cho

Jaemin Cho

@jmin__cho

Nov 8, 2024

Check out M3DocRAG -- multimodal RAG for question answering on Multi-Modal & Multi-Page & Multi-Documents (+ a new open-domain benchmark + strong results on 3 benchmarks)!

⚡️Key Highlights:

➡️ M3DocRAG flexibly accommodates various settings:
- closed & open-domain document contexts (from a single-page doc to a corpus of many long docs)
- single & multi-hop questions
- diverse elements (text, table, image, etc.)

➡️ M3DocVQA is a new open-domain DocVQA benchmark where models should answer multi-hop questions (across multiple pages and documents) 3K+ PDFs (w/ 40K+ pages)

➡️ Strong results on 3 benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), including SoTA results on MP-DocVQA

🧵👇

Existing DocVQA works focus on one of two methods:

(a) using multimodal LMs on single-page documents
-> can't handle long documents

(b) using an OCR+RAG pipeline on many/longer documents
-> ignores visual elements such as figures

M3DocRAG consists of 3 stages:

1) Extract visual embedding (e.g., w/ ColPali) from each page image.

2) Retrieve top-K pages (+ approximate indexing for faster search in open-domain setting)

3) Generate an answer with a multimodal LM (e.g., Qwen2-VL) given the retrieved K pages.

Read 13 tweets

Jaemin Cho

@jmin__cho

Feb 5, 2021

@jayleicn

Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!

Arxiv: arxiv.org/abs/2102.02779

Work done w/ @jayleicn @HaoTan5 @mohitban47 (@uncnlp)

🧵1/n

Existing methods for V+L learning typically require designing task-speciﬁc architectures and objectives for each task.
For example, a multi-label answer classiﬁer for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.

To alleviate these hassles, we propose a uniﬁed framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the V+L inputs.

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Jaemin Cho

Try unrolling a thread yourself!

More from @jmin__cho

Jaemin Cho

Jaemin Cho

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!