Introducing 🔥InstructBLIP🔥 - our new Multimodal Foundation Models with instruction tuning on BLIP2, achieving new SOTA results on various VL benchmarks and enjoying various advantages over GPT-4.
InstructBLIP unlocks a range of diverse multimodal capabilities for building next-generation AI agents, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.
(2/n)
Built on the success of #BLIP2, InstructBLIP proposes a general instruction-tuning framework, where Q-Former extracts instruction-aware visual features from output embeddings of frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM.
(3/n)
#InstructBLIP consistently improves our prior #BLIP2 models and significantly outperforms Deepmind’s #Flamingo-80B of much bigger model sizes on a variety of benchmarks for zero-shot vision and language tasks.
(4/n)
InstructBLIP aims to address the fundamental challenges in vision-language instruction tuning and conduct a systematic study with a comprehensive set of datasets and tasks for improving the models’ generalization ability to unseen data and tasks.
(5/n)
Find out more from our paper: arxiv.org/abs/2305.06500
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
📢Introducing 🔥#CodeTF🔥, a one-stop Transformer Library for Code Large Language Models (CodeLLM), with a unified interface for training & inference on code tasks (code generation,summarization,translation,etc)
CodeTF library supports both the development and deployment of Code LLMs for code intelligence tasks. The library can support both training and serving code LLM models, code utilities to process code data, and popular research benchmarks to evaluate the model performance.
(2/n)
CodeTF is designed with key principles to provide a user-friendly and easy-to-use platform for code intelligence tasks. It follows a modular architecture, enhancing its extensibility by allowing seamless integration of additional programming languages, models & utilities.
BLIP-Diffusion learns pretrained subject representation to unlock a range of zero-shot/few-step-tuned image generation and editing capabilities, e.g., subject-driven generation, zero-shot subject-driven image manipulation, controllable subject-driven image editing, etc.
(2/n)
Two-stage pretraining strategy: 1) multimodal representation learning with BLIP-2 to produce text-aligned visual features for an input image; 2) subject representation learning trains the Diffusion models to use the features by BLIP-2 to generate novel subject renditions.
Introducing 🔥CodeT5+🔥, a new family of open-source code LLMs for both code understanding and generation, achieved new SoTA code generation performance on HumanEval, surpassing all the open-source code LLMs.
CodeT5+ proposes a flexible model architecture of encoder-decoder with a mixture of varied pretraining tasks, which can flexibly operate in different modes (i.e., encoder-only, decoder-only, and encoder-decoder) for a wide range of code understanding and generation tasks.
(2/n)
The family of CodeT5+ models was trained on permissively-licensed code and with model sizes ranging from 220M to 16B, and can be initialized from frozen off-the-shelf LLMs (e.g., CodeGen or any GPT-type model) to efficiently train large models by saving huge compute cost.