Tweet

Jaemin Cho

Jun 7 • 8 tweets • 5 min read

@AdobeResearch

Want a captioning system to describe images in more detail & grammatically, but existing caption annotations are not fine-grained?

Check our #NAACL2022 Findings paper “Fine-grained Image Captioning with CLIP Reward”!

arxiv.org/abs/2205.13115

@AdobeResearch @uncnlp

🧵👇
(1/n)

https://twitter.com/ak92501/status/1530007802013417486

Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.

(2/n)

@jmhessel

We found that using CLIP-S (@jmhessel etal) as reward provides such fine-grained guidance; but we also found that the model trained with it degenerates with repeated words. Since CLIP is trained only with a contrastive objective, its text encoder doesn't care about grammar

(3/n)

To address this, we next inject grammar knowledge into CLIP, by finetuning its text encoder w/o requiring extra grammar annotations. We create negative sentences by editing original ones, and learn an MLP head to classify whether a sentence is grammatically correct or not.

(4/n)

The grammar score successfully addresses the text degeneration problem!

(5/n)

To comprehensively diagnose the aspect of caption descriptiveness / fine-grainedness, we introduce FineCapEval, a fine-grained caption evaluation dataset.

(6/n)

In our experiment, training with our CLIP-S + grammar reward provides more fine-grained captions and outperforms other rewards on FineCapEval across the board.
In addition, human evaluation also strongly prefers our approach to MLE & CIDEr-reward model baselines.

(7/n)

@david_s_yoon

Code: github.com/j-min/CLIP-Cap…

Thanks to all collaborators
@david_s_yoon @ajinkyakale @FranckDernoncou TrungBui @mohitban47
and reviewers for the feedback!
And thanks @ak92501 for the original tweet!

(8/n)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @jmin__cho

Jaemin Cho

@jmin__cho

Feb 5, 2021

@jayleicn

Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!

Arxiv: arxiv.org/abs/2102.02779

Work done w/ @jayleicn @HaoTan5 @mohitban47 (@uncnlp)

🧵1/n

Existing methods for V+L learning typically require designing task-speciﬁc architectures and objectives for each task.
For example, a multi-label answer classiﬁer for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.

To alleviate these hassles, we propose a uniﬁed framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the V+L inputs.

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Jaemin Cho

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @jmin__cho

Jaemin Cho

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?