. @OpenAI ImageGPT is one of the first transformer architectures applied to computer vision scenarios. 1/4
In language, unsupervised learning algorithms that rely on word prediction (like GPT-2 and BERT) are extremely successful.
One possible reason for this success is that instances of downstream language tasks appear naturally in the text.
2/4
In contrast, sequences of pixels do not clearly contain labels for the images they belong to.
However, OpenAI believes that sufficiently large transformer models:
- could be applied to 2D image analysis
- learn strong representations of a dataset
3/4
Examples: 1) DINO discovers and segments objects in an image or a video with absolutely no supervision 2) HuBERT learns both acoustic and language models from continuous inputs