In our #CVPR2022 Oral, we introduce the atemporal probe (ATP) to analyze *atemporal* (single frame) bias in video-language, with surprising results! (see 🧵)
The promise of videos is the potential to go *beyond* image-centric understanding (people, objects, scenes, etc.) towards event temporality, causality, and dynamics. Ideally, we want video-language benchmarks and models to realize this promise.
2/10
Our paper focuses on a fundamental question in video research: to what extent can "image-centric" understanding address "video" understanding?
Consider the example below: can we answer the question with only a single frame?
3/10
Standard techniques for measuring "image-centric" understanding: sampling a random frame, mean pooling...
But - videos can be naturally noisy! (frames w/camera blur, weird angles, uninformative)
Maybe standard methods under-represent the real boundary of image vs. video?
4/10
What if we were to select a "good frame" though? This is the core idea of our atemporal probe (ATP) model: use a *frozen* image-language encoder (CLIP) on a few random frames, then select one "good" encoding (without any temporal information) to pass on to the final task.
5/10
Surprising: ATP gives a strong "image-centric" bound for videos and language, even outperforming some state-of-the-art video models!
Takeaways: (1) datasets may be well-addressed with "single frame", (2) video-language models may be held back by processing noise.
6/10
Takeaways hold even for dataset explicitly designed for temporal/causal video-language understanding!
7/10
We take steps to "close the loop" on both takeaways: First, we show ATP can help identify temporally (multi-frame) challenging data, to better measure progress of video design components (temporal, motion, etc.). ATP is promising for future in-the-loop dataset design!
8/10
Second, we show how ATP can plug into the input of a multi-frame temporal model, and improve accuracy (less noise) and efficiency (fewer frames).
Excited to introduce a new generation of privacy protecting smart cameras in our #ICCV2021 Oral paper by @CarlosH_93, @henarfu and myself 🇺🇸🇨🇴. See 🧵 below for details!
Our cameras introduce optical distortions at acquisition time, capturing images that protect the identity of people in the scene while enabling vision algorithms to perform inference accurately.
We achieve this by back-propagating all the way down to the camera lens. To make the lens shape differentiable, we parametrize it using Zernike polynomials.