How to get URL link on X (Twitter) App
Our paper conducts a systematic study on vision-language instruction tuning. InstructBLIP substantially outperforms both BLIP-2 and the largest Flamingo on zero-shot evaluation. It also has SOTA finetuning performance when used as the model initialization on downstream tasks.
BLIP-2 knows the landmarks of Singapore
BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new SoTA on zero-shot captioning (121.6 CIDEr vs previous best 113.2). Equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the zero-shot instructed vision-to-language generation capabilities!