Latest Twitter Threads by @jack_merullo_ on Thread Reader App

Oct 3, 2022 • 4 tweets • 2 min read

Are language models (LMs) good models of the visual world? We show that without explicit grounding, LMs can directly use linear projections of image representations as soft prompts for vision-language (VL) tasks. This can be done without tuning the LM or image encoder!

But, we also find that performance on VL tasks depends on how much linguistic supervision pretraining the image encoder has. Eg CLIP is pretrained with full language descriptions, NF-ResNET is trained with lexical category information (imagenet labels), and BEIT is vision-only

Share this page!

Enter URL or ID to Unroll