Thread by @stevenhoi on Thread Reader App

Introducing 🔥InstructBLIP🔥 - our new Multimodal Foundation Models with instruction tuning on BLIP2, achieving new SOTA results on various VL benchmarks and enjoying various advantages over GPT-4.

Paper: arxiv.org/abs/2305.06500
Code: github.com/salesforce/LAV…
(1/n)

InstructBLIP unlocks a range of diverse multimodal capabilities for building next-generation AI agents, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.

(2/n)

Built on the success of #BLIP2, InstructBLIP proposes a general instruction-tuning framework, where Q-Former extracts instruction-aware visual features from output embeddings of frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM.
(3/n)

#InstructBLIP consistently improves our prior #BLIP2 models and significantly outperforms Deepmind’s #Flamingo-80B of much bigger model sizes on a variety of benchmarks for zero-shot vision and language tasks.

(4/n)

InstructBLIP aims to address the fundamental challenges in vision-language instruction tuning and conduct a systematic study with a comprehensive set of datasets and tasks for improving the models’ generalization ability to unseen data and tasks.

(5/n)

Find out more from our paper: arxiv.org/abs/2305.06500
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Great joint work w/ our intern @Wenliang_Dai and our amazing AI team @LiJunnan0409 @DongxuLi_
at @SFResearch and our collaborators.
(6/n)

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll