What do the Vision Transformers learn? How do they encode anything useful for image recognition? In our latest work, we reimplement a number of works done in this area & investigate various ViT model families (DeiT, DINO, original, etc.).
We’ve used the following methods for our analysis:
* Attention rollout
* Classic heatmap of the attention weights
* Mean attention distance
* Viz of the positional embeddings & linear projections
We hope our work turns out to be a useful resource for those studying ViTs.
3/
We’ve also built a @huggingface organization around our experiments. The organization holds the Keras pre-trained models & spaces where you can try the visualization on your own images.
We thank @fchollet for his helpful guidance on the tutorial. We thank @jarvislabsai & @GoogleDevExpert for providing us with credit support that allowed the experiments.
Thanks to @ritwik_raha for helping us with this amazing visual.
5/
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Includes a total of 15 I-1k and I-21k ConvNeXt models + conversion scripts, off-the-shelf inference, and fine-tuning code.
1/
These models ARE NOT opaque. You can load them like so and inspect whatever you need to:
2/
Here's the full disclosure on the accuracy scores. Differences are mainly for library implementation differences. But happy to stand corrected if someone has other suggestions.
Q: If I implement a paper, there are likely lots of implementations of that already existing. How do I make it worthwhile?
⬇️
1> I think the learning aspect should precede any other aspect in this regard. Whether or not it's gonna be worthwhile shouldn't matter if you are up for the learning challenge.
2> But if you want to make the implementation a part of your project portfolio, the following things could be helpful.
2.1> You could pick up papers that are a bit off the grid from the conventional ones while still being in your territory. You should also enjoy working on it.
Implementing a paper is helpful in so many ways. Get to
* Know the work inside out including the implementation details.
* Study amazing resources to further your understanding.
* Read a lot of code for references. Sometimes, the official codebases are amazing.
1/
Oftentimes, an idea seems fairly simple but when it comes to implementation details, things start to get messier. This is the learning, folks!
If the original impl. is messy, you might be able to make it elegant, simpler, and in turn, better.
2/
For me, implementing existing works has helped me become a better practitioner and also a better believer. It's almost always never easy but that's the real fun. It boosts your confidence and also your knowledge.
3/
We provide standalone scripts and also notebooks for training and testing our models. We open-source all the experimental results and pre-trained models:
Recipes that I find to be beneficial when working in low-data/imbalance regimes (vision):
* Use a weighted loss function &/or focal loss.
* Either use simpler/shallower models or use models that are known to work well in these cases. Ex: SimCLRV2, Big Transfer, DINO, etc.
1/n
* Use MixUp or CutMix in the augmentation pipeline to relax the space of marginals.
* Ensure a certain percent of minority class data is always present during each mini-batch. In @TensorFlow, this can be done using `rejection_resampling`.
* Use semi-supervised learning recipes that combine the benefits of self-supervision and few-shot learning. Ex: PAWS by @facebookai.
* Use of SWA is generally advised for better generalization but its use in these regimes is particularly useful.
3/n
New #Keras example is up on *consistency regularization*or an important recipe for semi-supervised learning and tackling distribution shifts as shown in *Noisy Student Training*.
This example provides a template for performing semi-supervised / weakly supervised learning. A few things one can plug right in:
* Incorporate more data while training the student.
* Filter the high-confidence predictions while training the student.
2/n
The example uses Stochastic Weight Averaging during training the teacher to induce geometric ensembling. With elements like Stochastic Dropout, the performance might even be better.