Research on Dialogue Interaction Mechanism of Multimodal Speech Agents Driven by Large Language Models_Vol. 4 No. 2 (JBDC 2026)_Journal of Big Data and Computing (ISSN: 2959-0590)

Home > Journal of Big Data and Computing (ISSN: 2959-0590) > Vol. 4 No. 2 (JBDC 2026) >

Research on Dialogue Interaction Mechanism of Multimodal Speech Agents Driven by Large Language Models

Download PDF

DOI: https://doi.org/10.62517/jbdc.202601216

Author(s)

Mi’na Yan

Affiliation(s)

College of Literature, Xizang Minzu University, Xianyang, Shaanxi, China

Abstract

This study addresses critical limitations of traditional speech agents in cross-modal semantic understanding, context-aware management, and dynamic response generation. We propose a novel LLM-Driven Multimodal Speech Agent (LMA) dialogue interaction mechanism, featuring: (1) a hierarchical cross-modal fusion architecture with dynamic attention mechanisms; (2) a cross-modal semantic alignment method combining LLM-guided contrastive learning; (3) a context-aware dialogue state manager with memory compression and dynamic attention; (4) a reinforcement learning-based dynamic multimodal response generation strategy. Experiments are conducted on three public benchmark datasets -- Fluent Speech Commands (FSC), DailyDialog, and Multimodal-E4 -- and a real-world self-collected dataset. Results demonstrate that LMA achieves intent recognition accuracy of 92.1% (vs. GPT-4o's 91.3%), dialogue coherence of 0.91 (vs. GPT-4o's 0.87), and task completion rate of 87.3% (vs. GPT-4o's 85.6%), significantly outperforming four baseline methods and two commercial systems. The ablation study validates the effectiveness of each module.

Keywords

Large Language Models; Multimodal Speech Agents; Cross-Modal Semantic Alignment; Dialogue State Management; Reinforcement Learning

References

[1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. 2017: 5998-6008. [2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems. 2020: 1877-1901. [3] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning. 2021: 8748-8763. [4] RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision[C]//Proceedings of the 40th International Conference on Machine Learning. 2023: 28492-28518. [5] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[R/OL]. arXiv, 2023. https://arxiv.org/abs/2303.08774. [6] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]//Advances in Neural Information Processing Systems. 2022: 27730-27744. [7] ZHANG D, LI S, ZHANG X, et al. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities[C]//Findings of EMNLP 2023. 2023: 15757-15773. [8] ZHANG D, ZHAO S, LI S, et al. SpeechAgents: Human-communication simulation with multi-modal multi-agent systems[R/OL]. arXiv, 2024. https://arxiv.org/abs/2401.03945. [9] ZHAO Z, WEI Y, LU Q, et al. A survey of multimodal large language model from a data-centric perspective[C]//Proceedings of ACL 2024. [10] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT 2019. 2019: 4171-4186. [11] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems. 2022: 24824-24837. [12] PENG S, FENG S, SUN M, et al. A survey on end-to-end automatic speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2024, 32: 945-961. [13] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models[R/OL]. arXiv, 2023. https://arxiv.org/abs/2302.13971. [14] VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: A generative model for raw audio[C]//Proceedings of the 9th ISCA Speech Synthesis Workshop. 2016.