Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.
Feb 5, 2021 • 6 tweets • 4 min read
Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!
Existing methods for V+L learning typically require designing task-speciﬁc architectures and objectives for each task.
For example, a multi-label answer classiﬁer for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.