Li Junnan Profile picture
Vision and Language, Code LLMs, Self-supervised/Weakly-supervised Learning, at Salesforce Research.
Oct 13 6 tweets 1 min read
A deeper dive into the creation process of Aria at @rhymes_ai_:
🧵 (1/6)
Aria’s training involves a four-stage pipeline. Each stage is designed to progressively enhance certain capabilities while maintaining those acquired in early stages. 🧵 (2/6)
To create a multimodal native model, the four stages include: language pre-training, multimodal pre-training, multimodal long-context pre-training, and multimodal post-training. All but the first stage involves a mixture of text and multimodal data.
May 12, 2023 8 tweets 4 min read
A new member in the BLIP family: 🔥InstructBLIP🔥, a vision-language instruction tuning framework. InstructBLIP achieves SoTA zero-shot performance with various advantages over other multimodal models such as GPT-4!
Github: github.com/salesforce/LAV…
Paper: arxiv.org/abs/2305.06500 Image Our paper conducts a systematic study on vision-language instruction tuning. InstructBLIP substantially outperforms both BLIP-2 and the largest Flamingo on zero-shot evaluation. It also has SOTA finetuning performance when used as the model initialization on downstream tasks.
Feb 3, 2023 9 tweets 3 min read
🔥BLIP-2🔥 demo is live! Come play with LLMs that can understand images and share your examples!
huggingface.co/spaces/Salesfo…
Project page: github.com/salesforce/LAV…
BLIP-2 knows mass–energy equivalence! More examples in the 🧵 BLIP-2 knows the landmarks of Singapore
Jan 31, 2023 4 tweets 2 min read
Can LLMs understand images? We introduce 🔥BLIP-2🔥, a generic and efficient vision-language pre-training strategy that bootstraps from frozen❄️image encoders and frozen❄️LLMs. BLIP-2 outperforms existing SoTAs with only 188M trainable parameters!
Github: github.com/salesforce/LAV… BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new SoTA on zero-shot captioning (121.6 CIDEr vs previous best 113.2). Equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the zero-shot instructed vision-to-language generation capabilities!