Improving Dense Video Captioning with a Transformer-based Multimodal Fusion Model
DOI: https://doi.org/10.62517/jiem.202403407
Author(s)
Yixuan Liu, Ziwei Zhou*, Shuyue Hui, Haoyuan Ma, Hongju Li, Zhibo Zhang
Affiliation(s)
School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, China
*Corresponding Author
Abstract
Dense Video Captioning (DVC) plays a pivotal role in advancing video understanding within computer vision and natural language processing. Traditional DVC models have predominantly focused on visual information, often neglecting the auditory component. To address this limitation, we propose a Transformer-based multimodal fusion model that integrates audio and visual cues for comprehensive multimodal input processing. Built on an encoder-decoder architecture, the model synergizes audio and visual streams. The feature encoder combines self-attention mechanisms with convolutional neural networks to achieve precise audio feature encoding, while the decoder employs multimodal fusion by leveraging intermodal confidence scores to adaptively integrate inputs. A feedforward neural network enhances historical textual representations, and strategic skip connections eliminate redundant data, prioritizing key video features for refined captioning. Extensive validation on benchmark datasets MSR-VTT and MSVD demonstrates that our model outperforms existing methods, achieving BLEU-4, ROUGE, METEOR, and CIDEr scores of 0.427, 0.618, 0.294, and 0.532 on MSR-VTT, and 0.539, 0.741, 0.369, and 0.976 on MSVD. By effectively leveraging the complementary strengths of audio and visual data, our model establishes a new benchmark in DVC, offering precise and comprehensive video content interpretation.
Keywords
Audio-Visual Integration; Dense Video Captioning; Multimodal Fusion; Transformer Networks
References
[1] L. Zhou, Y. Zhou, J. J. Corso, et al., "End-to-end dense video captioning with masked transformer," IAENG International Journal of Computer Science, vol. 2018, no. 023, pp. 8739-8748, 2018.
[2] J. Wang, W. Jiang, L. Ma, et al., "Bidirectional attentive fusion with context gating for dense video captioning," IAENG International Journal of Computer Science, vol. 2018, no. 023, pp. 7190-7198, 2018.
[3] Y. Xiong, B. Dai, D. Lin, "Move forward and tell: a progressive generator of video descriptions," IAENG International Journal of Computer Science, vol. 2018, no. 023, pp. 468-483, 2018.
[4] J. Mun, L. Yang, Z. Ren, et al., "Streamlined dense video captioning," IAENG International Journal of Computer Science, vol. 2019, no. 023, pp. 6588-6597, 2019.
[5] Y. Li, T. Yao, Y. Pan, et al., "Jointly localizing and describing events for dense video captioning," IAENG International Journal of Computer Science, vol. 2018, no. 023, pp. 7492-7500, 2018.
[6] C. Y. Ma, A. Kadav, I. Melvin, et al., "Attend and interact: higher-order object interactions for video understanding," IAENG International Journal of Computer Science, vol. 2018, no. 023, pp. 6790-6800, 2018.
[7] H. Xu, B. Li, V. Ramanishka, et al., "Joint event detection and description in continuous video streams," IAENG International Journal of Computer Science, vol. 2019, no. 023, pp. 396-405, 2019.
[8] J. Wang, S. Zeng, W. Li, et al., "Attention Mechanism Video Description Model Based on Dilated Convolution," Electronic Measurement Technology, vol. 2021, no. 023, pp. 044.
[9] X. Li, T. Zhang, Z. Zhang, H. Wei, and Y. Qian, "A Survey of Transformer in the Field of Computer Vision," Journal of Computer Engineering & Applications, vol. 59, no. 1, pp. 1-10, 2023.
[10] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, "Spatio-temporal attention networks for action recognition and detection," IEEE Transactions on Multimedia, vol. 22, no. 11, pp. 2990-3001, 2020.
[11] G. Sharma, K. Umapathy, and S. Krishnan, "Trends in audio signal feature extraction methods," Applied Acoustics, vol. 158, p. 107020, 2020.
[12] H. Fırat, M. E. Asker, and D. Hanbay, "Hybrid 3D convolution and 2D depthwise separable convolution neural network for hyperspectral image classification," Balkan Journal of Electrical and Computer Engineering, vol. 10, no. 1, pp. 35-46, 2022.
[13] Y. Goldberg, Neural Network Methods for Natural Language Processing. Springer Nature, 2022.
[14] K. Kavukcuoglu, P. Sermanet, Y. L. Boureau, K. Gregor, M. Mathieu, and Y. Cun, "Learning convolutional feature hierarchies for visual recognition," in Advances in Neural Information Processing Systems, vol. 23, 2010.
[15] K. Shim, M. Lee, I. Choi, Y. Boo, and W. Sung, "SVD-softmax: Fast softmax approximation on large vocabulary neural networks," in Advances in Neural Information Processing Systems, vol. 30, 2017.
[16] S. Savian, M. Elahi, and T. Tillo, "Optical flow estimation with deep learning, a survey on recent advances," in Deep Biometrics, 2020, pp. 257-287.