A new member in the BLIP family: 🔥InstructBLIP🔥, a vision-language instruction tuning framework. InstructBLIP achieves SoTA zero-shot performance with various advantages over other multimodal models such as GPT-4!
Github: github.com/salesforce/LAV…
Paper: arxiv.org/abs/2305.06500
Our paper conducts a systematic study on vision-language instruction tuning. InstructBLIP substantially outperforms both BLIP-2 and the largest Flamingo on zero-shot evaluation. It also has SOTA finetuning performance when used as the model initialization on downstream tasks.
In addition, we introduce instruction-aware visual feature extraction, a new method that enables the model to extract informative features tailored to the given instruction, leading to enhanced generalization performance.
We open-source a suite of InstructBLIP models using two family of LLMs: FlanT5 and Vicuna. Using our LAVIS library, you can run these models with two lines of code! github.com/salesforce/LAV…
InstructBLIP demonstrates a variety of strong multimodal capabilities including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc. Checkout this demo video!
InstructBLIP demonstrates strong visual reasoning capability of complex scenes, generalizing beyond its training data to OOD images.
🔥BLIP-2🔥 demo is live! Come play with LLMs that can understand images and share your examples! huggingface.co/spaces/Salesfo…
Project page: github.com/salesforce/LAV…
BLIP-2 knows mass–energy equivalence! More examples in the 🧵
Can LLMs understand images? We introduce 🔥BLIP-2🔥, a generic and efficient vision-language pre-training strategy that bootstraps from frozen❄️image encoders and frozen❄️LLMs. BLIP-2 outperforms existing SoTAs with only 188M trainable parameters!
Github: github.com/salesforce/LAV…
BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new SoTA on zero-shot captioning (121.6 CIDEr vs previous best 113.2). Equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the zero-shot instructed vision-to-language generation capabilities!
Why is BLIP-2 effective? Previous methods (e.g. Flamingo) uses a image-to-text generative loss. However, a generative loss is insufficient to bridge the modality gap. We instead train a Querying Transformer in two learning stages: representation learning and generative learning.