Excited to share our work on Conditional Object-Centric Learning from Video!

We introduce SAVi, a slot-based model that can discover + represent visual entities in videos, using simple location cues and object motion (...or entirely unsupervised)

🖥️ slot-attention-video.github.io

When trained entirely unsupervised (by simply reconstructing the input video), SAVi learns to decompose videos into meaningful entities, such as objects or parts that move independently.

While this works on (simple) real data, such as in this robotic grasping environment...

...bridging the gap to visually more complex scenes with diverse textures is a challenge, esp. since the notion of an object can often be ambiguous.

Simple cues, such as points on objects in the first frame, and predicting motion (optical flow) can break this ambiguity...

...and allow SAVi to decompose, segment, and track moving objects in visually far more complicated environments, using real-world backgrounds and realistic household objects -- without receiving explicit supervision for this task.

Caveat: this only works for moving objects.

Conditioning the slots of SAVi on external context / cues gives us an interface for the model at test time:

This allows SAVi to decompose scenes at different hierarchy levels (e.g. objects/parts), depending on which context (in the form of conditioning signals) is provided.

Check out our paper to learn more about the model & find many more experiments/results!

Paper: arxiv.org/abs/2111.12594
Project page: slot-attention-video.github.io

Scaling slot-based NNs to diverse real-world data with minimal supervision is an exciting challenge for future work.

Joint work w/ amazing collaborators in the Brain Team at Google Research & Robotics at Google: @gamaleldinfe, Aravindh Mahendran, Austin Stone, @sabour_sara, Georg Heigold, @rico_jski, Alexey Dosovitskiy & Klaus Greff


• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Thomas Kipf

Thomas Kipf Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @thomaskipf

29 Jun 20
Excited to share our work @GoogleAI on Object-centric Learning with Slot Attention!

Slot Attention is a simple module for structure discovery and set prediction: it uses iterative attention to group perceptual inputs into a set of slots.

Paper: arxiv.org/abs/2006.15055

Slot Attention is related to self-attention, with some crucial differences that effectively turn it into a meta-learned clustering algorithm.

Slots are randomly initialized for each example and then iteratively refined. Everything is symmetric under permutation.

Slot Attention can be used in a simple auto-encoder architecture that learns to decompose scenes into objects.

Compared to prior slot-based approaches (IODINE/MONet), no intermediate decoding is needed, which significantly improves efficiency.

Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Follow Us on Twitter!