Vision and Language, Code LLMs, Self-supervised/Weakly-supervised Learning, at Salesforce Research.
Oct 13 • 6 tweets • 1 min read
A deeper dive into the creation process of Aria at @rhymes_ai_:
🧵 (1/6)
Aria’s training involves a four-stage pipeline. Each stage is designed to progressively enhance certain capabilities while maintaining those acquired in early stages.
🧵 (2/6)
To create a multimodal native model, the four stages include: language pre-training, multimodal pre-training, multimodal long-context pre-training, and multimodal post-training. All but the first stage involves a mixture of text and multimodal data.
May 12, 2023 • 8 tweets • 4 min read
A new member in the BLIP family: 🔥InstructBLIP🔥, a vision-language instruction tuning framework. InstructBLIP achieves SoTA zero-shot performance with various advantages over other multimodal models such as GPT-4!
Github: github.com/salesforce/LAV…
Paper: arxiv.org/abs/2305.06500
Our paper conducts a systematic study on vision-language instruction tuning. InstructBLIP substantially outperforms both BLIP-2 and the largest Flamingo on zero-shot evaluation. It also has SOTA finetuning performance when used as the model initialization on downstream tasks.
Feb 3, 2023 • 9 tweets • 3 min read
🔥BLIP-2🔥 demo is live! Come play with LLMs that can understand images and share your examples! huggingface.co/spaces/Salesfo…
Project page: github.com/salesforce/LAV…
BLIP-2 knows mass–energy equivalence! More examples in the 🧵
BLIP-2 knows the landmarks of Singapore
Jan 31, 2023 • 4 tweets • 2 min read
Can LLMs understand images? We introduce 🔥BLIP-2🔥, a generic and efficient vision-language pre-training strategy that bootstraps from frozen❄️image encoders and frozen❄️LLMs. BLIP-2 outperforms existing SoTAs with only 188M trainable parameters!
Github: github.com/salesforce/LAV…
BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new SoTA on zero-shot captioning (121.6 CIDEr vs previous best 113.2). Equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the zero-shot instructed vision-to-language generation capabilities!