Latest Twitter Threads by @LiJunnan0409 on Thread Reader App

Oct 13, 2024 • 6 tweets • 1 min read

A deeper dive into the creation process of Aria at @rhymes_ai_:
🧵 (1/6)
Aria’s training involves a four-stage pipeline. Each stage is designed to progressively enhance certain capabilities while maintaining those acquired in early stages. 🧵 (2/6)
To create a multimodal native model, the four stages include: language pre-training, multimodal pre-training, multimodal long-context pre-training, and multimodal post-training. All but the first stage involves a mixture of text and multimodal data.

May 12, 2023 • 8 tweets • 4 min read

A new member in the BLIP family: 🔥InstructBLIP🔥, a vision-language instruction tuning framework. InstructBLIP achieves SoTA zero-shot performance with various advantages over other multimodal models such as GPT-4!
Github: github.com/salesforce/LAV…
Paper: arxiv.org/abs/2305.06500

Our paper conducts a systematic study on vision-language instruction tuning. InstructBLIP substantially outperforms both BLIP-2 and the largest Flamingo on zero-shot evaluation. It also has SOTA finetuning performance when used as the model initialization on downstream tasks.

Feb 3, 2023 • 9 tweets • 3 min read

🔥BLIP-2🔥 demo is live! Come play with LLMs that can understand images and share your examples!
huggingface.co/spaces/Salesfo…
Project page: github.com/salesforce/LAV…
BLIP-2 knows mass–energy equivalence! More examples in the 🧵

BLIP-2 knows the landmarks of Singapore

Jan 31, 2023 • 4 tweets • 2 min read

Can LLMs understand images? We introduce 🔥BLIP-2🔥, a generic and efficient vision-language pre-training strategy that bootstraps from frozen❄️image encoders and frozen❄️LLMs. BLIP-2 outperforms existing SoTAs with only 188M trainable parameters!
Github: github.com/salesforce/LAV…

BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new SoTA on zero-shot captioning (121.6 CIDEr vs previous best 113.2). Equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the zero-shot instructed vision-to-language generation capabilities!

Share this page!

Enter URL or ID to Unroll