How to get URL link on X (Twitter) App
Existing DocVQA works focus on one of two methods:
https://twitter.com/ak92501/status/1530007802013417486Toward more descriptive and distinctive caption generation, we propose using CLIP to calculate multimodal similarity and use it as a reward function. This avoids imitating only the reference caption and instead transfers fine-grained details from similar training images.



Existing methods for V+L learning typically require designing task-specific architectures and objectives for each task.