Jaemin Cho Profile picture
PhD student at @unccs @uncnlp. Previously at @allen_ai, Naver, and SNU
Jun 7, 2022 8 tweets 5 min read
Want a captioning system to describe images in more detail & grammatically, but existing caption annotations are not fine-grained?

Check our #NAACL2022 Findings paper “Fine-grained Image Captioning with CLIP Reward”!

arxiv.org/abs/2205.13115

@AdobeResearch @uncnlp

🧵👇
(1/n) Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.

(2/n)
Feb 5, 2021 6 tweets 4 min read
Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!

Arxiv: arxiv.org/abs/2102.02779

Work done w/ @jayleicn @HaoTan5 @mohitban47 (@uncnlp)

🧵1/n Existing methods for V+L learning typically require designing task-specific architectures and objectives for each task.
For example, a multi-label answer classifier for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.