Fine-Grained Spatio-temporal Feature Learning Network for Video Understanding
Meng Li, Zongwen Bai*, Meili Zhou
School of Physics and Electronic Information, Yan’an University, Yan’an, Shaanxi, China *Corresponding Author.
In order to enhance video action recognition accuracy, we introduce an enhanced model built upon the GSF-ResNet-50 to address the challenges in fine-grained recognition tasks: the Fine-Grained Spatio-Temporal Feature Learning Network (FSTFL-Net). This model learns rich and detailed spatio-temporal features while maintaining a low computational overhead to capture local representations between similar actions and subtle differences between actions. FSTFL Net introduces a spatiotemporal correlation (STC) module, which can find the spatial neighborhood represented by each local region, the correlation between adjacent and non adjacent frames, and convert these correlations into relationship values to establish spatial connections within the same frame and temporal connections between different frames. This enables FSTFL Net to improve its discriminative ability for fine-grained images, thereby enhancing recognition ability. In a series of rigorous tests, the suggested approach has demonstrated a notable enhancement in recognition precision across four datasets dedicated to action recognition. Specifically, on the Sth-Sth V1 dataset, the FSTFL-Net has achieved a 2.65% increase in Top-1 accuracy compared to the initial model.
Fine-grained Spatio-temporal Feature; Action Recognition; Spatio-temporal Features; Spatio-temporal Correlation; Top-1 Accuracy
