Steven Hoi Profile picture
May 12, 2023 6 tweets 4 min read Read on X
Introducing 🔥InstructBLIP🔥 - our new Multimodal Foundation Models with instruction tuning on BLIP2, achieving new SOTA results on various VL benchmarks and enjoying various advantages over GPT-4.

Paper: arxiv.org/abs/2305.06500
Code: github.com/salesforce/LAV…
(1/n)
InstructBLIP unlocks a range of diverse multimodal capabilities for building next-generation AI agents, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.

(2/n) Image
Built on the success of #BLIP2, InstructBLIP proposes a general instruction-tuning framework, where Q-Former extracts instruction-aware visual features from output embeddings of frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM.
(3/n) Image
#InstructBLIP consistently improves our prior #BLIP2 models and significantly outperforms Deepmind’s #Flamingo-80B of much bigger model sizes on a variety of benchmarks for zero-shot vision and language tasks.

(4/n) Image
InstructBLIP aims to address the fundamental challenges in vision-language instruction tuning and conduct a systematic study with a comprehensive set of datasets and tasks for improving the models’ generalization ability to unseen data and tasks.

(5/n) Image
Find out more from our paper: arxiv.org/abs/2305.06500
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Great joint work w/ our intern @Wenliang_Dai and our amazing AI team @LiJunnan0409 @DongxuLi_
at @SFResearch and our collaborators.
(6/n)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Steven Hoi

Steven Hoi Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @stevenhoi

Jun 2, 2023
📢Introducing 🔥#CodeTF🔥, a one-stop Transformer Library for Code Large Language Models (CodeLLM), with a unified interface for training & inference on code tasks (code generation,summarization,translation,etc)

Paper: arxiv.org/abs/2306.00029
Code: github.com/salesforce/Cod…

(1/n)
CodeTF library supports both the development and deployment of Code LLMs for code intelligence tasks. The library can support both training and serving code LLM models, code utilities to process code data, and popular research benchmarks to evaluate the model performance.
(2/n) Image
CodeTF is designed with key principles to provide a user-friendly and easy-to-use platform for code intelligence tasks. It follows a modular architecture, enhancing its extensibility by allowing seamless integration of additional programming languages, models & utilities.

(3/n) Image
Read 5 tweets
May 25, 2023
Introducing 🔥BLIP-Diffusion🔥, a novel method for enabling Text-to-image Diffusion models with multimodal controllable generation/editing, powered by BLIP-2 pre-trained text-aligned subject representation.

Paper: arxiv.org/abs/2305.14720
Project: dxli94.github.io/BLIP-Diffusion…

(1/n)
BLIP-Diffusion learns pretrained subject representation to unlock a range of zero-shot/few-step-tuned image generation and editing capabilities, e.g., subject-driven generation, zero-shot subject-driven image manipulation, controllable subject-driven image editing, etc.

(2/n) Image
Two-stage pretraining strategy: 1) multimodal representation learning with BLIP-2 to produce text-aligned visual features for an input image; 2) subject representation learning trains the Diffusion models to use the features by BLIP-2 to generate novel subject renditions.

(3/n) Image
Read 9 tweets
May 16, 2023
Introducing 🔥CodeT5+🔥, a new family of open-source code LLMs for both code understanding and generation, achieved new SoTA code generation performance on HumanEval, surpassing all the open-source code LLMs.

Paper: arxiv.org/pdf/2305.07922…
Code: github.com/salesforce/Cod…

(1/n) Image
CodeT5+ proposes a flexible model architecture of encoder-decoder with a mixture of varied pretraining tasks, which can flexibly operate in different modes (i.e., encoder-only, decoder-only, and encoder-decoder) for a wide range of code understanding and generation tasks.

(2/n) Image
The family of CodeT5+ models was trained on permissively-licensed code and with model sizes ranging from 220M to 16B, and can be initialized from frozen off-the-shelf LLMs (e.g., CodeGen or any GPT-type model) to efficiently train large models by saving huge compute cost.

(3/n) Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(