Fine-Grained Sentiment Analysis of Public Opinion Videos Based on Conformer and Multi-Layered Interaction Attention
DOI: https://doi.org/10.62517/jbdc.202401313
Author(s)
Chaolong Liu, Zhengguang Gao, Lihong Zhang*
Affiliation(s)
Research Center for Network Public Opinion Governance, China People’s Police University, Langfang, China
*Corresponding Author.
Abstract
With the rapid development of internet technologies and the widespread adoption of smart devices, social media platforms have become significant channels for information dissemination and public sentiment expression. In particular, new media formats such as short videos have shown a substantial impact on public opinion guidance and emotional transmission, making sentiment analysis of short video content highly meaningful. However, existing research has limitations in modality interactions, often employing weighted summation or self-attention mechanisms for deep fusion of extracted features. These approaches fail to fully account for the complex local dependencies and hierarchical structures among modalities. To address these issues, this paper proposes a fine-grained sentiment analysis model for public opinion videos based on Conformer and multi-layered interaction attention mechanisms, termed DW-MIACon. The model first utilizes DeBERTa, CLIP, and Wav2Vec models to extract features from text, images, and audio, respectively. Subsequently, the extracted multimodal features are fused using a Dynamic Weighted Multi-layered Interaction Attention (DW-MIA) mechanism, generating rich fusion feature representations. Finally, a Conformer model is employed to deeply integrate the fused features, capturing complex interactions and local dependencies between modalities. Experimental results demonstrate that the proposed model significantly outperforms existing approaches in multimodal sentiment recognition tasks, notably enhancing the accuracy of fine-grained sentiment classification and the ability to identify subtle emotional nuances.
Keywords
Public Opinion; Conformer; Multi-layered Interaction Attention; Fine-grained; Sentiment Analysis
References
[1]Tang X. Ethical Reflections on Media Emotional Reporting in the Post-Truth Era. Young Journalist, 2021, (12): 111-112.
[2]WILLIAMS J, KLEINEGESSE S, COMANESCU R, et al. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion // Proceedings of Grand Challenge and Workshop on Human Multimodal Language: Association for Computational Linguistics, 2018: 11-19.
[3]ZADEH A, LIANG P P, MAZUMDER N, et al. Memory Fusion Network for Multi-View Sequential Learning. [2023-05-05]. https://arxiv.org/abs/1802.00927.
[4]ZADEH A, LIANG P P, PORIA S, et al. Multi-Attention Recurrent Network for Human Communication Comprehension // Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).
[5]HAN W, CHEN H, GELBUKH A, et al. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis// Proceedings of 2021 International Conference on Multimodal Interaction. New York, USA: ACM Press, 2021: 6-15.
[6]MORENCY L P, MIHALCEA R, DOSHI P. Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web// Proceedings of the 13th International Conference on Multimodal Interfaces, 2011: 169-176.
[7]YU Y, LIN H, MENG J, et al. Visual and Textual Sentiment Analysis of a Microblog Using Deep Convolutional Neural Networks. Algorithms, 2016, 9(2): 41.
[8]TSAI Y H H, BAI S, LIANG P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences// Proceedings of the Conference, Association for Computational Linguistics, 2019.
[9]YANG K, XU H, GAO K. CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis// Proceedings of the 28th ACM International Conference on Multimedia, 2020: 521-528.
[10]Feng C, Yang H, Wang S, et al. Multimodal Sentiment Analysis Based on Top-Down Mask Generation and Stacked Transformers. Computer Engineering and Application, 1-11 [2024-09-10]. http://kns.cnki.net/kcms/detail/11. 2127.TP.20240815.1136.002.html.
[11]WU J J, WANG J Y, ZHU P, et al. Dual-Modal Sentiment Computing Model Based on MLP and Multi-Head Self-Attention Feature Fusion. Computer Applications, 2024, 44(S1): 39-43.
[12]HE P, LIU X, GAO J, et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv preprint arXiv:2006.03654, 2020.
[13]RADFORD A, KIM J W, HALLACY C, et al. Learning Transferable Visual Models from Natural Language Supervision// International Conference on Machine Learning. PMLR, 2021: 8748.
[14]BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 2020, 33: 12449-12460.
[15]CHEN Y, LAI Y B, XIAO A, et al. Multimodal Sentiment Analysis Model Based on CLIP and Cross-Attention. Journal of Zhengzhou University (Engineering Science), 2024, 45(02): 42-50. DOI: 10.13705/j.issn.1671-6833.2024.02.003.
[16]ZHOU J F, YE S R, WANG H. Text Sentiment Classification Based on Deep Convolutional Neural Network Model. Computer Engineering, 2019, 45(03): 300-308. DOI: 10.19678/j.issn.1000-3428.0050043.
[17]ZADEH A, CHEN M, PORIA S. Tensor Fusion Network for Multimodal Sentiment Analysis. arxiv preprint arxiv:1707.07250, 2017.
[18]LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors. arXiv preprint arXiv:1806.00064, 2018.
[19]HAN W, CHEN H, PORIA S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arXiv preprint arXiv:2109.00412, 2021.