In our #CVPR2022 Oral, we introduce the atemporal probe (ATP) to analyze *atemporal* (single frame) bias in video-language, with surprising results! (see 🧵)

Led by Shyamal Buch with @CristbalEyzagu2, @adnothing, @jiajunwu_cs, @drfeifei

atp-video-language.stanford.edu

1/10
The promise of videos is the potential to go *beyond* image-centric understanding (people, objects, scenes, etc.) towards event temporality, causality, and dynamics. Ideally, we want video-language benchmarks and models to realize this promise.

2/10
Our paper focuses on a fundamental question in video research: to what extent can "image-centric" understanding address "video" understanding?

Consider the example below: can we answer the question with only a single frame?

3/10
Standard techniques for measuring "image-centric" understanding: sampling a random frame, mean pooling...

But - videos can be naturally noisy! (frames w/camera blur, weird angles, uninformative)
Maybe standard methods under-represent the real boundary of image vs. video?

4/10
What if we were to select a "good frame" though? This is the core idea of our atemporal probe (ATP) model: use a *frozen* image-language encoder (CLIP) on a few random frames, then select one "good" encoding (without any temporal information) to pass on to the final task.

5/10
Surprising: ATP gives a strong "image-centric" bound for videos and language, even outperforming some state-of-the-art video models!

Takeaways: (1) datasets may be well-addressed with "single frame", (2) video-language models may be held back by processing noise.

6/10
Takeaways hold even for dataset explicitly designed for temporal/causal video-language understanding!

7/10
We take steps to "close the loop" on both takeaways: First, we show ATP can help identify temporally (multi-frame) challenging data, to better measure progress of video design components (temporal, motion, etc.). ATP is promising for future in-the-loop dataset design!

8/10
Second, we show how ATP can plug into the input of a multi-frame temporal model, and improve accuracy (less noise) and efficiency (fewer frames).

9/10
For more, please visit our oral talk (Session 1.2.3; today afternoon) and poster (Session 1.2).
Project: atp-video-language.stanford.edu
Paper: arxiv.org/abs/2206.01720
Video:

10/10

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Juan Carlos Niebles -Looking for interns #ICCV2021

Juan Carlos Niebles -Looking for interns #ICCV2021 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jcniebles

Oct 11, 2021
Excited to introduce a new generation of privacy protecting smart cameras in our #ICCV2021 Oral paper by @CarlosH_93, @henarfu and myself 🇺🇸🇨🇴. See 🧵 below for details!

PDF: openaccess.thecvf.com/content/ICCV20…
Talk:
Project: carloshinojosa.me/project/privac…

@StanfordAILab
Our cameras introduce optical distortions at acquisition time, capturing images that protect the identity of people in the scene while enabling vision algorithms to perform inference accurately.

*Key idea*: joint end-to-end optimization of optical distortions & vision algorithm.
We achieve this by back-propagating all the way down to the camera lens. To make the lens shape differentiable, we parametrize it using Zernike polynomials.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(