Introducing 🔥InstructBLIP🔥 - our new Multimodal Foundation Models with instruction tuning on BLIP2, achieving new SOTA results on various VL benchmarks and enjoying various advantages over GPT-4.
Paper: arxiv.org/abs/2305.06500
Code: github.com/salesforce/LAV…
(1/n)
InstructBLIP unlocks a range of diverse multimodal capabilities for building next-generation AI agents, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.
(2/n)
Built on the success of #BLIP2, InstructBLIP proposes a general instruction-tuning framework, where Q-Former extracts instruction-aware visual features from output embeddings of frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM.
(3/n)
#InstructBLIP consistently improves our prior #BLIP2 models and significantly outperforms Deepmind’s #Flamingo-80B of much bigger model sizes on a variety of benchmarks for zero-shot vision and language tasks.
(4/n)
InstructBLIP aims to address the fundamental challenges in vision-language instruction tuning and conduct a systematic study with a comprehensive set of datasets and tasks for improving the models’ generalization ability to unseen data and tasks.
(5/n)
Find out more from our paper: arxiv.org/abs/2305.06500
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Great joint work w/ our intern @Wenliang_Dai and our amazing AI team @LiJunnan0409 @DongxuLi_
at @SFResearch and our collaborators.
(6/n)
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.