Research on Object Detection Algorithms Based on Multimodal Images
DOI: https://doi.org/10.62517/jike.202604214
Author(s)
Quanling Ma
Affiliation(s)
Anhui University of Technology, Ma'anshan, Anhui, China
*Corresponding Author
Abstract
Object detection, as a core task within computer vision, holds irreplaceable application value in critical domains such as autonomous driving environmental perception and intelligent security surveillance. Monomodal RGB image detection methods, reliant solely on visible light information, often suffer from insufficient feature extraction and confusion between objects and backgrounds in suboptimal scenarios like low light or complex backgrounds, resulting in inadequate detection robustness.To address this limitation, two RGB-thermal infrared multimodal fusion detection algorithms are proposed: the Cross-modal Attention Enhancement Network employs a joint channel-position attention mechanism to extract deep-level correlated features across modalities, while incorporating a recurrent enhancement strategy to optimise the representation of object edge details;Progressive Feature Fusion Network, which employs a symmetric dual-branch architecture to enhance thermal infrared modality weighting, and adopts a strategy combining coarse-to-fine granularity perception with fine-grained fusion to achieve efficient hierarchical feature aggregation.
Keywords
Multimodal Fusion; Object Detection; RGB-Thermal Infrared
References
[1] Chen J Q, Lu J C, Zhu X T, et al. Generative semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 7111-7120.
[2] Zhang Jiyou, Zhang Rongfen, Liu Yuhong, et al. Multimodal image semantic segmentation based on attention mechanisms [J]. Liquid Crystal and Display, 2023, 38(7): 975-984.
[3] Han Rui-ze, Feng Wei, Guo Qing, et al. A Review of Research Progress in Single Object Tracking in Video [J]. Journal of Computers, 2022, 45(9): 1877-1907.
[4] Tu, H. Y., Wang, W. L., Chen, J. C., et al. A Review of Image Translation Based on Conditional Generative Adversarial Networks [J]. Journal of Computer-Aided Design and Graphics, 2024, 36(1): 14-32.
[5] Xue, Zihan; Ge, Haibo; Wang, Shuxian; et al. A Transformer-based Tracking Algorithm with Fast Edge Attention Fusion. Journal of Computer Engineering & Applications, 2025, 61(1).
[6] Wang Shuiyuan, Hou Zhiqiang, Li Fucheng, et al. A Lightweight Video Object Segmentation Algorithm with Adaptive Weight Updates [J]. Chinese Journal of Image and Graphics, 2023, 28(12): 3772-3783.
[7] Wu Jintao, Wang Anzhi, Ren Chunhong. A Review of RGB-T Saliency-Based Object Detection[J]. Infrared Technology, 2025, 47(1): 1-9.
[8] Wang W, Lai Q, Fu H, et al. Salient Object Detection in the Deep Learning Era: An In-Depth Survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(6): 3239-3259.
[9] Hao C, Yu Z, Liu X, et al. A simple yet effective network based on vision transformer for camouflaged object and salient object detection[J]. IEEE Transactions on Image Processing, 2025.
[10] Borji A, Cheng M M, Hou Q, et al. Salient object detection: A survey[J]. Computational Visual Media, 2019, 5(2): 117-150.
[11] Zhang Shoudong, Yang Ming, Hu Tai. A saliency-based object detection algorithm utilising multi-feature fusion[J]. Computer Science and Exploration, 2019, 13(5): 834-845.
[12] Yang Chengbang, Wang Anzhi, Ren Chunhong, et al. A review of salient object detection in video based on deep neural networks[J]. Journal of Computer Engineering & Applications, 2024, 60(19).
[13] Shi, C. J., Zhang, W. M., Chen, H. R., et al. Review of Saliency Detection Based on Deep Learning [J]. Computer Science and Exploration, 2021, 15(2): 219-232.
[14] Liu Y, et al. Infrared and visible image fusion with convolutional neural networks[J]. International Journal of Wavelets, Multiresolution and Information Processing, 2018, 16(3): 1850018.
[15] Wang G, Li C, Ma Y, et al. RGB-T saliency detection benchmark: Dataset, baselines, analysis and a novel approach[C]//Image and Graphics Technologies and Applications: 13th Conference on Image and Graphics Technologies and Applications, IGTA 2018, Beijing, China, 8–10 April 2018, Revised Selected Papers 13. Singapore: Springer, 2018: 359–369.
[16] Tu Z, Xia T, Li C, et al. RGB-T image saliency detection via collaborative graph learning[J]. IEEE Transactions on Multimedia, 2019, 22(1): 160-173.
[17] Zhang Q, Huang N, Yao L, et al. RGB-T salient object detection via fusing multi-level CNN features[J]. IEEE Transactions on Image Processing, 2019, 29: 3321-3335.
[18] Tu Z, Li Z, Li C, et al. Multi-interactive dual-decoder for RGB-thermal salient object detection[J]. IEEE Transactions on Image Processing, 2021, 30: 5678-5691.
[19] Tu Z, Ma Y, Li Z, et al. RGBT salient object detection: A large-scale dataset and benchmark[J]. IEEE Transactions on Multimedia, 2022, 25: 4163-4176.
[20] Wang X, Shu X, Zhang S, et al. MFGNet: Dynamic modality-aware filter generation for RGB-T tracking[J]. IEEE Transactions on Multimedia, 2022, 25: 4335-4348.
[21] Huo F, Zhu X, Zhang Q, et al. Real-time one-stream semantic-guided refinement network for RGB-thermal salient object detection[J]. IEEE Transactions on Instrumentation and Measurement, 2022, 71: 1-12.
[22] Wu J, Zhou W, Qian X, et al. MFENet: Multitype fusion and enhancement network for detecting salient objects in RGB-T images[J]. Digital Signal Processing, 2023, 133: 103827.
[23] Pang Y, Zhao X, Zhang L, et al. CAVER: Cross-modal view-mixed transformer for bi-modal salient object detection[J]. IEEE Transactions on Image Processing, 2023, 32: 892-904.
[24] Jin D, Shao F, Xie Z, et al. CAFCNet: Cross-modality asymmetric feature complement network for RGB-T salient object detection[J]. Expert Systems with Applications, 2024, 247: 123222.
[25] Luan T, Zhang H, Li J, et al. Object fusion tracking for RGB-T images via channel swapping and modal mutual attention[J]. IEEE Sensors Journal, 2023, 23(19): 22930-22943.
[26] Zhang Y, Yu H, He Y, et al. Illumination-guided RGBT object detection with inter-and intra-modality fusion[J]. IEEE Transactions on Instrumentation and Measurement, 2023, 72: 1-12.
[27] Wang J, Li G, Shi J, et al. Weighted guided optional fusion network for RGB-T salient object detection[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(5): 1-20.
[28] Gao S H, Cheng M M, Zhao K, et al. Res2Net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652-662.
[29] Han K, Wang Y, Chen H, et al. A survey on vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 87-110.