In our #CVPR2022 Oral, we introduce the atemporal probe (ATP) to analyze *atemporal* (single frame) bias in video-language, with surprising results! (see 🧵)
Led by Shyamal Buch with @CristbalEyzagu2, @adnothing, @jiajunwu_cs, @drfeifei
atp-video-language.stanford.edu
1/10
The promise of videos is the potential to go *beyond* image-centric understanding (people, objects, scenes, etc.) towards event temporality, causality, and dynamics. Ideally, we want video-language benchmarks and models to realize this promise.
2/10
Our paper focuses on a fundamental question in video research: to what extent can "image-centric" understanding address "video" understanding?
Consider the example below: can we answer the question with only a single frame?
3/10
Standard techniques for measuring "image-centric" understanding: sampling a random frame, mean pooling...
But - videos can be naturally noisy! (frames w/camera blur, weird angles, uninformative)
Maybe standard methods under-represent the real boundary of image vs. video?
4/10
What if we were to select a "good frame" though? This is the core idea of our atemporal probe (ATP) model: use a *frozen* image-language encoder (CLIP) on a few random frames, then select one "good" encoding (without any temporal information) to pass on to the final task.
5/10
Surprising: ATP gives a strong "image-centric" bound for videos and language, even outperforming some state-of-the-art video models!
Takeaways: (1) datasets may be well-addressed with "single frame", (2) video-language models may be held back by processing noise.
6/10
Takeaways hold even for dataset explicitly designed for temporal/causal video-language understanding!
7/10
We take steps to "close the loop" on both takeaways: First, we show ATP can help identify temporally (multi-frame) challenging data, to better measure progress of video design components (temporal, motion, etc.). ATP is promising for future in-the-loop dataset design!
8/10
Second, we show how ATP can plug into the input of a multi-frame temporal model, and improve accuracy (less noise) and efficiency (fewer frames).
9/10
For more, please visit our oral talk (Session 1.2.3; today afternoon) and poster (Session 1.2).
Project: atp-video-language.stanford.edu
Paper: arxiv.org/abs/2206.01720
Video:
10/10
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.