TY - GEN
T1 - Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization
AU - Qian, Rui
AU - Li, Yuxi
AU - Liu, Huabin
AU - See, John
AU - Ding, Shuangrui
AU - Liu, Xian
AU - Li, Dian
AU - Lin, Weiyao
N1 - Funding Information:
In this work, we propose a multi-level feature optimization framework for unsupervised video representation learning. We perform instance-and semantic-wise discrimination on high-level features, thereby employing reliable self-supervisory cues to optimize lower-level representations for improved generalization. Meanwhile, we also leverage multi-level features of various temporal spans for robust temporal modeling. Extensive experiments demonstrate that our learned representations achieve superior performance on a series of downstream tasks. Acknowledgement The paper is supported in part by the following grants: National Key Research and Development Program of China Grant (No.2018AAA0100400), National Natural Science Foundation of China (No. 61971277), and our corporate sponsors.
Publisher Copyright:
© 2021 IEEE
PY - 2022/2/28
Y1 - 2022/2/28
N2 - The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding. Code is available here.
AB - The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding. Code is available here.
UR - http://www.scopus.com/inward/record.url?scp=85123338990&partnerID=8YFLogxK
U2 - 10.1109/ICCV48922.2021.00789
DO - 10.1109/ICCV48922.2021.00789
M3 - Conference contribution
AN - SCOPUS:85123338990
SP - 7970
EP - 7981
BT - 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
PB - IEEE
T2 - 18th IEEE/CVF International Conference on Computer Vision 2021
Y2 - 11 October 2021 through 17 October 2021
ER -