STEMM Institute Press
Science, Technology, Engineering, Management and Medicine
Research on Multi-modal Data Fusion and Decision Explanation System for Autonomous Driving Based on Large Language Models
DOI: https://doi.org/10.62517/jbdc.202601215
Author(s)
Jinyao Lu
Affiliation(s)
University of Nottingham Ningbo China, Ningbo, Zhejiang, China
Abstract
With the rapid development of autonomous driving technology towards L4 (highly automated), public trust in unmanned driving systems has become increasingly prominent. The "decision black box" characteristic of current autonomous driving systems makes it difficult for users to understand their decision logic, which has become a key obstacle restricting technology promotion. This paper proposes a multi-modal data fusion and decision explanation system based on large language models, aiming to improve the transparency and interpretability of autonomous driving systems. The system adopts the Transformer architecture, integrates multi-modal data such as LiDAR point clouds, camera images, and text instructions, achieves semantic-level understanding through feature-level fusion, and generates natural language decision explanations. This research provides a new solution for explainable artificial intelligence in autonomous driving systems, with significant theoretical value and practical significance.
Keywords
Autonomous Driving; Large Language Models; Multi-Modal Data Fusion; Explainable AI; Decision Explanation
References
[1] Survey USA. (2024). Public Perception of Autonomous Driving Technology. Survey USA Reports. [2] Zhang, Q., Gui, T., Zheng, R., & Huang, X. (2023). Large-scale language models: From theory to practice. Beijing, China: Publishing House of Electronics Industry. [3] Qiu, X. (2019). Neural networks and deep learning. Beijing, China: China Machine Press. [4] Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, Hongsheng Li (2023). LMDrive: Closed-loop end-to-end driving with large language models. Proceedings of the IEEE International Conference on Computer Vision, 1-10. [5] Yuan, J., Sun, S., Omeiza, D., Zhao, B., Newman, P., Kunze, L., & Gadd, M. (2024). RAG-Driver: Generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model. [6] Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, Hesheng Wang (2023). “Cam4DOcc: Benchmark for camera-only 4D occupancy forecasting in autonomous driving applications”. Shanghai Jiao Tong University. [7] Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, Jiwen Lu (2023). SelfOcc: Self-supervised vision-based 3D occupancy prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1-12. [8] Walter Zimmer, Gerhard Arya Wardana, Xingcheng Zhou, Rui Song, Suren Sritharan, Alois C. Knoll (2023). TraffiX-A V2X dataset for multi-modal cooperative 3D object detection of traffic participants using onboard and roadside sensors. IEEE Transactions on Intelligent Transportation Systems, 24(8), 1-15. [9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [10] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
Copyright @ 2020-2035 STEMM Institute Press All Rights Reserved