The model in this paper learns to associate one or more objects to the effects they have on their environment (shadows, reflections, etc.) for a given video and rough segmentation masks of each object. This enables video effects like "background replacement". 2/5
and "color pop" and a "stroboscopic" effect (in the next tweet): 3/5
It's trained using self-supervision with a really interesting loss function that I go through in the post.
The main objective is to reconstruct the original video as a composition of each layer (and the background), but there's lots of other aspects that improve the model. 4/5
Hope you enjoy the blog post and check out the original work.
Here's a little summary of the different parts for those curious: 1/5
The Dataset has to be passed to the DataLoader. It's where you transform your data and where the inputs and labels are stored.
It is basically one big list of (input, label) tuples. So when you index in like dataset[i], it returns (input[ i ], label[ i ]).
2/5
The sampler and batch sampler are used to choose which inputs & labels to yield to the next batch.
Artistic license warning 👨🎨⚠️: They don't actually grab the inputs and labels, and the dataset doesn't actually deplete. They tell the DataLoader which indices to grab.
3/5