Jack Merullo Profile picture
NLP Phd Student @BrownUniversity
Oct 3, 2022 4 tweets 2 min read
Are language models (LMs) good models of the visual world? We show that without explicit grounding, LMs can directly use linear projections of image representations as soft prompts for vision-language (VL) tasks. This can be done without tuning the LM or image encoder! But, we also find that performance on VL tasks depends on how much linguistic supervision pretraining the image encoder has. Eg CLIP is pretrained with full language descriptions, NF-ResNET is trained with lexical category information (imagenet labels), and BEIT is vision-only