Thread by @jcniebles on Thread Reader App

In our #CVPR2022 Oral, we introduce the atemporal probe (ATP) to analyze *atemporal* (single frame) bias in video-language, with surprising results! (see 🧵)

Led by Shyamal Buch with @CristbalEyzagu2, @adnothing, @jiajunwu_cs, @drfeifei

atp-video-language.stanford.edu

1/10

The promise of videos is the potential to go *beyond* image-centric understanding (people, objects, scenes, etc.) towards event temporality, causality, and dynamics. Ideally, we want video-language benchmarks and models to realize this promise.

2/10

Our paper focuses on a fundamental question in video research: to what extent can "image-centric" understanding address "video" understanding?

Consider the example below: can we answer the question with only a single frame?

3/10

Standard techniques for measuring "image-centric" understanding: sampling a random frame, mean pooling...

But - videos can be naturally noisy! (frames w/camera blur, weird angles, uninformative)
Maybe standard methods under-represent the real boundary of image vs. video?

4/10

What if we were to select a "good frame" though? This is the core idea of our atemporal probe (ATP) model: use a *frozen* image-language encoder (CLIP) on a few random frames, then select one "good" encoding (without any temporal information) to pass on to the final task.

5/10

Surprising: ATP gives a strong "image-centric" bound for videos and language, even outperforming some state-of-the-art video models!

Takeaways: (1) datasets may be well-addressed with "single frame", (2) video-language models may be held back by processing noise.

6/10

Takeaways hold even for dataset explicitly designed for temporal/causal video-language understanding!

7/10

We take steps to "close the loop" on both takeaways: First, we show ATP can help identify temporally (multi-frame) challenging data, to better measure progress of video design components (temporal, motion, etc.). ATP is promising for future in-the-loop dataset design!

8/10

Second, we show how ATP can plug into the input of a multi-frame temporal model, and improve accuracy (less noise) and efficiency (fewer frames).

9/10

For more, please visit our oral talk (Session 1.2.3; today afternoon) and poster (Session 1.2).
Project: atp-video-language.stanford.edu
Paper: arxiv.org/abs/2206.01720
Video:

10/10

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll