Happy to release GOAL ⚽️, a multimodal dataset based on football highlights that includes 1) videos, 2) human transcriptions, and 3) Wikidata-based KB with statistics about players and teams for every match.
[1/7] Previous video benchmarks consider movies or TV series that typically involve scripted interaction between characters instead of visually grounded language. On the other hand, in GOAL we focus on football commentaries because they involve visually grounded language
[2/7]: GOAL pushes the boundaries of current multimodal models because it requires the encoding of 1) videos; 2) commentary; 3) KB information. All these elements are essential when generating a sound and coherent commentary for a football video.
[3/7]: GOAL contains highlights derived from several football leagues (e.g., Serie A, Premier League, etc.), professional human transcriptions, as well as information about the match derived from Wikidata. Just like real commentators, models should use all these modalities.
[4/7]: We set up a multi-faceted evaluation based on several tasks: 1) commentary retrieval, 2) frame reordering, 3) moment retrieval and 4) commentary generation. We use the HERO model for our evaluation and demonstrate that the commentary generation task is really challenging.
[5/7]: We analysed both HERO and BART predictions when considering just the language or the multimodal context as a reference. This highlighted the poor visual grounding ability of this model. We also show that it’s important to incorporate the KB information to boost performance
[6/7]: GOAL can also serve as resource a resource for visual context-aware speech recognition; multi-modal fact-checking, and multi-modal activity recognition. Additionally, GOAL represents an interesting benchmark for models that directly solve the task using audio information.