Introducing 🔥InstructBLIP🔥 - our new Multimodal Foundation Models with instruction tuning on BLIP2, achieving new SOTA results on various VL benchmarks and enjoying various advantages over GPT-4.


InstructBLIP unlocks a range of diverse multimodal capabilities for building next-generation AI agents, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.


Built on the success of #BLIP2, InstructBLIP proposes a general instruction-tuning framework, where Q-Former extracts instruction-aware visual features from output embeddings of frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM.

#InstructBLIP consistently improves our prior #BLIP2 models and significantly outperforms Deepmind’s #Flamingo-80B of much bigger model sizes on a variety of benchmarks for zero-shot vision and language tasks.


InstructBLIP aims to address the fundamental challenges in vision-language instruction tuning and conduct a systematic study with a comprehensive set of datasets and tasks for improving the models’ generalization ability to unseen data and tasks.


Find out more from our paper:
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Great joint work w/ our intern @Wenliang_Dai and our amazing AI team @LiJunnan0409 @DongxuLi_
at @SFResearch and our collaborators.

