Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization

Hao Liu1,
Lirong Jie1,
Zixu Li1,
1Shandong University
2Harbin Institute of Technology (Shenzhen)

The usage instructions and model are currently being organized and will be available soon.

Framework

Abstract

Video moment localization (VML) aims to identify the temporal boundary semantically matching the given query. Compared to other paradigms, point-supervised VML balances localization accuracy and annotation cost. However, it is still in its infancy due to explicit granularity alignment and implicit scale perception. To this end, we propose a Semantic Granularity and Scale Correspondence Integration (SG-SCI) framework aimed at leveraging limited single-frame annotation for correspondence learning. It explicitly models semantic relations of different feature granularities and adaptively mines the implicit semantic scale, thereby enhancing feature representations of varying granularities and scales. SG-SCI employs a granularity correspondence alignment module to align semantic information by leveraging latent prior knowledge. Then we develop a scale correspondence learning strategy to identify and address semantic scale differences. Extensive experiments on benchmark datasets have demonstrated the promising performance of our model over several state-of-the-art competitors.

BibTeX

@inproceedings{wang2024explicit,
  title={Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization},
  author={Wang, Kun and Liu, Hao and Jie, Lirong and Li, Zixu and Hu, Yupeng and Nie, Liqiang},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={9214--9223},
  year={2024}
}

Try not to become a man of success but rather try to become a man of value.

Created by Gatsby & React & Tailwind CSS & Emotion

Co-development with my bro Russellwzr .