🧵/5
Most video captioning systems can only describe a single event in short videos. But natural videos may contain numerous events. So we focus on the dense video captioning task, which requires temporally localizing and captioning all events in untrimmed minutes-long videos 🎞️.
2/5