Rewatched @Tesla's AI day recently, and when @karpathy introduced the Transformer used in AutoPilot, it immediately reminded me of @DeepMind's #PerceiverIO which I recently contributed @huggingface. Wonder whether Tesla's approach was inspired by it...
... or whether they were already using this (long) before the paper's introduction. Especially the sentence "you initialize a raster the size of the output space that you'd like and tile it with position encodings "=> this is exactly what Perceiver IO does as well! @drew_jaegle
This idea is brilliant actually: the features of the 8 camera's serve as keys (K) and values (V), while the individual pixels of the output (vector) space (bird's eye view) provide queries (Q) for multi-head attention (tiled with sin/cos position embeddings).
This allows the car to operate in "vector" space (a top-level view), rather than in image space (each camera individually). This results in much better performance. The relevant part described here starts at 56:00:
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Happy to share my first @Gradio demo hosted as a @huggingface Space! It showcases @facebookai's new DINO self-supervised method, which allows Vision Transformers to segment objects within an image without ever being trained to do so! Try it yourself!
I've also converted all ViT's trained with DINO from the official repository and uploaded them to the hub: huggingface.co/models?other=d…. Just load them into a ViTModel or ViTForImageClassification ;)
Also, amazed at how ridicously easy @Gradio and @huggingface Spaces are, I got everything set up in 10 minutes