Roadside Perception Fusion Algorithm Based on Transformer and Design of Lightweight Edge Computing Platform
DOI: https://doi.org/10.62517/jike.202604204
Author(s)
Liangdong Zuo1,2,*, Jie Li3, Jia Liu1, Hejia Li1, Mingfei Huang1
Affiliation(s)
1Chongqing College of Architecture and Technology, Chongqing, China
2Chongqing Research Institute of Shanghai Jiao Tong University, Chongqing, China
3Chongqing University of Science and Technology, Chongqing, China
*Corresponding Author
Abstract
To address challenges of roadside perception systems (single-sensor limitations, ineffective multi-modal fusion, incompatible heavy models with edge devices), this paper proposes a Transformer-based fusion algorithm and a lightweight edge platform. It designs a Multi-Modal Transformer Fusion (MMTF) network with dual encoders and cross-attention for adaptive feature fusion, lightweightens the network via knowledge distillation and layer pruning, and builds an edge platform based on NVIDIA Jetson Orin NX. Experiments on DAIR-V2X and TUMTraf Intersection datasets show the MMTF achieves 92.3% 3D object detection mAP (6.8%/9.2% higher than CNN-based/single-sensor algorithms), with the lightweight model reducing parameters by 65% and latency by 58%, and the platform running stably at 30 FPS (12W). This provides a high-performance, low-cost solution for roadside perception in ITS with significant theoretical and engineering value.
Keywords
Roadside Perception; Multi-Modal Fusion; Transformer; Cross-Attention; Lightweight Model
References
[1] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR).
[2] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022.
[3] Li, B., Zhao, Y., & Tan, H. (2024). CoFormerNet: A Transformer-Based Fusion Approach for Enhanced Vehicle-Infrastructure Cooperative Perception. Sensors, 24(13), 4101.
[4] Zhou, Y., Yang, C., Wang, P., Wang, C., Wang, X., & Van, N. N. (2024). ViT-FuseNet: Multimodal Fusion of Vision Transformer for Vehicle-Infrastructure Cooperative Perception. 2024 IEEE International Conference on Robotics and Automation (ICRA).
[5] Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., & Tai, C. L. (2022). TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10860–10869.
[6] Zheng, Y., Ge, T., Fei, X., Ren, J., Li, Z., & Sun, J. (2022). BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. 17th European Conference on Computer Vision (ECCV), 1–21.
[7] Sun, P., Xie, L., Li, Z., Wang, X., & Luo, P. (2022). PETR: Position Embedding Transformation for Multi-View 3D Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15421–15430.
[8] Han, K., Xiao, A., Wu, E., Guo, J., & Xu, C. (2022). EdgeFormer: On Improving Vision Transformers on Mobile Devices. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12548–12557.
[9] Seid, S., Zennaro, M., Libsie, M., Pietrosemoli, E., & Manzoni, P. (2020). A Low Cost Edge Computing and LoRaWAN Real Time Video Analytics for Road Traffic Monitoring. 2020 16th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN), 246–251.
[10] Ahmed, T., Ejaz, N., & Choudhury, S. (2024). Redefining Real-Time Road Quality Analysis With Vision Transformers on Edge Devices. 2024 IEEE International Conference on Communications (ICC).
[11]Feng, D., Haase-Schuetz, C., Rosenbaum, L., Hertlein, H., Glaeser, C., Timm, F., & Dietmayer, K. (2021). Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems, 22(6), 3368–3387.
[12]Li, T., Li, Z., Liu, H., & Fan, X. (2023). Multi-Sensor Fusion for Roadside Perception in Autonomous Driving: A Review. IEEE Transactions on Intelligent Transportation Systems, 24(6), 5892–5910.
[13]Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR, abs/1704.04861.
[14]Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4510–4520.
[15]Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuScenes: A Multimodal Dataset for Autonomous Driving. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11621–11631.