I'm glad that I finally can tell you about our paper MTTR (arxiv.org/abs/2111.14821) which got accepted to #CVPR2022, lead by Adam Botach and supervised by @ChaimBaskin (1/n)
In this work we tackle a complex multi-modal problem of referring video segmentation -- segmenting an object in a video given its textual description. (2/n)
We propose a very simple (even if it may not look so) end-to-end trainable pipeline, consisting of single multimodal Transformer model. It is free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. (3/n) A detailed overview of MTTR...
The model consists of feature extractor (Video Swin for video frozen RoBERTa for text), multimodal transformer inspired by DETR and prediction head. The multimodal transformer gets frame features along with textual features. (4/n)
It stores the extracted info in object queries, allowing natural tracking (same query in different frames correspond to same object). For each query we generate mask and referring score -- probability that it is referred by the query. For more details, check the paper (5/n)
For training, we match the object queries with the GT objects using bipartite matching (weight is referring score and dice coeff. of the masks). The loss is sum of mask loss (dice+focal) and referring loss (CE). We use both referred and unreffered objects for training. (6/n)
During inference we just return the object with highest referring score. (7/n)
Main benchmarks for the task are A2D-Sentences and JHMDB-Sentences. Only A2D is used for training, while JHMDB is for validation only. Sadly, the labeling in those dataset is far from perfect: in A2D 3–5 frames per sample are annotated, and JHMDB uses puppet masks. (8/n)
There was a stagnation in benchmarks: sota in July 2020 was 38.8% mAP on A2D and it was 39.9% at the time of MTTR publication (paperswithcode.com/sota/referring…). Moreover, we found that some papers wrongly calculated some metrics (e.g., calculating mAP as mean of precisions). (9/n) Paperswithcode plot
Our best model shows 46.1 (+5.7) and 39.2 (+5.0)
mAP gains on the A2D and JHMDB respectively, while processing 76 frames per second on a single RTX 3090 GPU. (10/n) Tables with results
We also evaluate on Refer-YouTube-VOS (youtube-vos.org/dataset/rvos/), which use denser masks (every five frames), but mostly focuses on appearance rather than actions. The dataset is recent and less popular, and the split was changed after introduction of the competition. (11/n)
The competition entries often use pretraining on other datasets and ensembling, which significantly boosts the performance (VisTR claims 8-10 mAP boost from COCO pretraining github.com/Epiphqny/VisTR…). Yet we were able to outperform second place in the last year competition. (12/n)
Finally, we made the code and the models available, including colab notebook and spaces demo which you can try by yourself at github.com/mttr2021/MTTR. That in particular allowed us to test the model in various scenarios beyond the numbers in a single dataset. (13/n)
Ability to detect an object in the crowd of similar object (this is admittedly cherry-picked but yet impressive as for me) (14/n)
Color understanding (15/n)
Generalization to unseen instances (no wolves in training set) (16/n)
Generalization to unseen instances 2 (no capybaras either) (17/n)
Generalization across domains (18/n)
Understanding of directions (19/n)
Actually, directions appear to be the strongest prior to the level that rest of the query is ignored sometimes (this one I got by mistake) (20/n)
I'll conclude with a list of limitations, which current benchmarks seem to miss: longer videos (minutes or hours), detecting lack of object, larger datasets, measure of temporal consistency, generalization to unseen instances. I hope new datasets will address those. (21/n, n=21)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Evgenii Zheltonozhskii

Evgenii Zheltonozhskii Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @evgeniyzhe

Mar 26, 2021
Our new paper, C2D (arxiv.org/abs/2103.13646, github.com/ContrastToDivi…) shows how self-supervised pre-training boosts learning with noisy labels, achieves SOTA performance and provides in-depth analysis. Authors @evgeniyzhe @ChaimBaskin Avi Mendelson, Alex Bronstein, @orlitany 1/n
The problem of learning with noisy labels (LNL) has great practical importance: large clean dataset is often expensive or impossible to get. Existing approaches to LNL either modify the loss to account for the noise or try to detect the noisy-labelled samples. 2/n
Yet, we need a starting point. For that, we use "warm-up": regular training on full dataset, which relies on the intrinsic robustness of DNNs to the noise. Main goals of warm-up are providing feature extractor and keeping loss of the noisy samples high, avoiding memorization. 3/n
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(