Presenting our new V+L pretraining work: “Unifying Vision-and-Language Tasks via Text Generation”,
a single unified generative framework (VL-T5 / VL-BART) for diverse multimodal tasks!
Existing methods for V+L learning typically require designing task-specific architectures and objectives for each task.
For example, a multi-label answer classifier for VQA, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc.
To alleviate these hassles, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the V+L inputs.
On 7 popular V+L benchmarks (VQA, GQA, VCR, NLVR2, RefCOCOg, COCO caption, Multi30k), most of which have been prev modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific SOTA V+L models.
Moreover, our generative approach shows better generalization ability on VQA w/ rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters & gets similar performance to separately optimized single-task models