Deep CNN object features for improved action recognition in low quality videos

Saimunur Rahman, John See*, Chiung Ching Ho

*Corresponding author for this work

Human action recognition from low quality video remains a challenging task for the action recognition community. Recent state-of-the-art methods such as space-time interest point (STIP) uses shape and motion features for characterization of action. However, STIP features are over-reliant on video quality and lack robust object semantics. This paper harness the robustness of deeply learned object features from off-the-shelf convolutional neural network (CNN) models to improve action recognition under low quality conditions. A two-channel framework that aggregates shape and motion features extracted using STIP detector, and frame-level object features obtained from the final few layers (i.e., FC6, FC7, softmax layer) of a state-of-the-art image-trained CNN model is proposed. Experimental results on low quality versions of two publicly available datasets—UCF-11 and HMDB51, showed that the use of CNN object features together with conventional shape and motion can greatly improve the performance of action recognition in low quality videos.

  • Action recognition
  • CNN
  • Deep learning
  • Feature representation
  • Low quality video
  • STIP

