WikiVideo is a challenging task that VideoLLMs can’t do!
It requires inference across multiple videos (avg 8 per topic) and requires models recognize low-level semantic features, like entities, and draw higher-level inferences about the unfolding event.
To tackle this challenge, we present a collaborative, test-time scalable method: Collaborative Article Generation (CAG). CAG involves the collaboration between a VideoLLM and reasoning model to iterate through video content and synthesize it into an article
We find that CAG performs better than existing methods across all metrics, but still has a long way to go! There is plenty of future work in efficient and multi-video inference, high-level understanding, and improving video retrieval performance!
If you’re interested in article generation from videos and other tasks that require understanding events in videos, checkout our ACL Workshop MAGMAR and our related work!
This work was done in collaboration w/ colleagues at Johns Hopkins University: Reno Kriz, William Walden, @kesnet50, Hannah Recknor, @EYangTW, Francis Ferraro, and @ben_vandurme
• • •
Missing some Tweet in this thread? You can try to
force a refresh