STEMM Institute Press
Science, Technology, Engineering, Management and Medicine
Fine-Grained Spatio-temporal Feature Learning Network for Video Understanding
DOI: https://doi.org/10.62517/jbdc.202401103
Author(s)
Meng Li, Zongwen Bai*, Meili Zhou
Affiliation(s)
School of Physics and Electronic Information, Yan’an University, Yan’an, Shaanxi, China *Corresponding Author.
Abstract
In order to enhance video action recognition accuracy, we introduce an enhanced model built upon the GSF-ResNet-50 to address the challenges in fine-grained recognition tasks: the Fine-Grained Spatio-Temporal Feature Learning Network (FSTFL-Net). This model learns rich and detailed spatio-temporal features while maintaining a low computational overhead to capture local representations between similar actions and subtle differences between actions. FSTFL Net introduces a spatiotemporal correlation (STC) module, which can find the spatial neighborhood represented by each local region, the correlation between adjacent and non adjacent frames, and convert these correlations into relationship values to establish spatial connections within the same frame and temporal connections between different frames. This enables FSTFL Net to improve its discriminative ability for fine-grained images, thereby enhancing recognition ability. In a series of rigorous tests, the suggested approach has demonstrated a notable enhancement in recognition precision across four datasets dedicated to action recognition. Specifically, on the Sth-Sth V1 dataset, the FSTFL-Net has achieved a 2.65% increase in Top-1 accuracy compared to the initial model.
Keywords
Fine-grained Spatio-temporal Feature; Action Recognition; Spatio-temporal Features; Spatio-temporal Correlation; Top-1 Accuracy
References
[1]Simonyan K, Zisserman A. Two-Stream Convolutional Networks for Action Recognition in Videos. Advances in neural information processing systems, 2014, 1. DOI: 10.1002/14651858.CD001941.pub3. [2]Feichtenhofer C, Fan H, Malik J, et al. SlowFast Networks for Video Recognition // 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019. DOI: 10.1109/ICCV.2019.00630. [3]D. Zhang, X. Dai, and Y.-F. Wang, “Dynamic temporal pyramid network: A closer look at multi-scale modeling for activity detection,” in Asian Conference on Computer Vision, 2018, pp. 712–728. [4]Yang C, Xu Y, Shi J, et al. Temporal Pyramid Network for Action Recognition//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020. DOI: 10.1109/CVPR42600.2020.00067. [5]Sudhakaran, S., Escalera, S., & Lanz, O. (2022). Gate-Shift-Fuse for Video Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 10913-10928. [6]L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V an Gool, "Temporal segment networks for action recognition in videos," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2740–2755, 2019. [7]B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proc. of European Conference on Computer Vision, 2018, pp. 803–818. [8]J. Lin, C. Gan and S. Han, "TSM: Temporal Shift Module for Efficient Video Understanding," 2019 IEEE / CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 7082 - 7092, doi: 10.1109/ICCV.2019.00718. [9]Z. Liu, L. Wang, W. Wu, C. Qian and T. Lu, "TAM: Temporal Adaptive Module for Video Recognition," 2021 IEEE / CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 13688-13698, doi: 10.1109/ICCV48922.2021.01345. [10]M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion feature network: Fixed motion filter for action recognition,” in Proc. of European Conference on Computer Vision, 2018, pp. 387–403. [11]H. Kwon, M. Kim, S. Kwak, and M. Cho, “Motionsqueeze: Neural motion feature learning for video understanding,” in Proc. of European Conference on Computer Vision, 2020, pp. 345–362. [12]Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML'10). Omnipress, Madison, WI, USA, 807–814. [13]loffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015). [14]Carreira J, Zisserman A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. IEEE, 2017. DOI: 10.1109/CVPR.2017.502. [15]K. Soomro, A. R. Zamir, and M. Shah, “ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint. arXiv:1212.0402, 2012. [16]R. Goyal, S. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In Proc. Int. Conf. Comput. Vis., pages 5842–5850, 2017. [17]He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. IEEE, 2016.DOI: 10.1109/CVPR.2016.90. [18]L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proc. Eur. Conf. Comput. Vis., pages 20–36, 2016.
Copyright @ 2020-2035 STEMM Institute Press All Rights Reserved