Research on Dialogue Interaction Mechanism of Multimodal Speech Agents Driven by Large Language Models
DOI: https://doi.org/10.62517/jbdc.202601216
Author(s)
Mi’na Yan
Affiliation(s)
College of Literature, Xizang Minzu University, Xianyang, Shaanxi, China
Abstract
This study addresses critical limitations of traditional speech agents in cross-modal semantic understanding, context-aware management, and dynamic response generation. We propose a novel LLM-Driven Multimodal Speech Agent (LMA) dialogue interaction mechanism, featuring: (1) a hierarchical cross-modal fusion architecture with dynamic attention mechanisms; (2) a cross-modal semantic alignment method combining LLM-guided contrastive learning; (3) a context-aware dialogue state manager with memory compression and dynamic attention; (4) a reinforcement learning-based dynamic multimodal response generation strategy. Experiments are conducted on three public benchmark datasets -- Fluent Speech Commands (FSC), DailyDialog, and Multimodal-E4 -- and a real-world self-collected dataset. Results demonstrate that LMA achieves intent recognition accuracy of 92.1% (vs. GPT-4o's 91.3%), dialogue coherence of 0.91 (vs. GPT-4o's 0.87), and task completion rate of 87.3% (vs. GPT-4o's 85.6%), significantly outperforming four baseline methods and two commercial systems. The ablation study validates the effectiveness of each module.
Keywords
Large Language Models; Multimodal Speech Agents; Cross-Modal Semantic Alignment; Dialogue State Management; Reinforcement Learning
References
[1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. 2017: 5998-6008.
[2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems. 2020: 1877-1901.
[3] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning. 2021: 8748-8763.
[4] RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision[C]//Proceedings of the 40th International Conference on Machine Learning. 2023: 28492-28518.
[5] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[R/OL]. arXiv, 2023. https://arxiv.org/abs/2303.08774.
[6] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]//Advances in Neural Information Processing Systems. 2022: 27730-27744.
[7] ZHANG D, LI S, ZHANG X, et al. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities[C]//Findings of EMNLP 2023. 2023: 15757-15773.
[8] ZHANG D, ZHAO S, LI S, et al. SpeechAgents: Human-communication simulation with multi-modal multi-agent systems[R/OL]. arXiv, 2024. https://arxiv.org/abs/2401.03945.
[9] ZHAO Z, WEI Y, LU Q, et al. A survey of multimodal large language model from a data-centric perspective[C]//Proceedings of ACL 2024.
[10] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT 2019. 2019: 4171-4186.
[11] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems. 2022: 24824-24837.
[12] PENG S, FENG S, SUN M, et al. A survey on end-to-end automatic speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2024, 32: 945-961.
[13] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models[R/OL]. arXiv, 2023. https://arxiv.org/abs/2302.13971.
[14] VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: A generative model for raw audio[C]//Proceedings of the 9th ISCA Speech Synthesis Workshop. 2016.