Multimodal Recognition Methods for Maritime Vessel Identification in Complex Scenarios_Vol. 3 No. 1 (JIKE 2025)_Journal of Intelligence and Knowledge Engineering (ISSN: 2959-0620)

Home > Journal of Intelligence and Knowledge Engineering (ISSN: 2959-0620) > Vol. 3 No. 1 (JIKE 2025) >

Multimodal Recognition Methods for Maritime Vessel Identification in Complex Scenarios

DOI: https://doi.org/10.62517/jike.202504101

Author(s)

Qiuyu Tian1,*, Kun Wang2, Hongwei Tang1,3, Rui Zhu2

Affiliation(s)

1Institute of Information Superbahn, Nanjing, Jiangsu, China 2Institute of Computing Technology Chinese Academy of Sciences, Beijing, China 3University of Chinese Academy of Sciences, Nanjing, Jiangsu, China *Corresponding Author.

Abstract

Maritime vessel recognition is a crucial task in maritime monitoring and traffic management, supporting various applications such as vessel tracking, operational safety, and anomaly detection. Traditional vessel detection and recognition processes rely heavily on manual inspection, which is constrained by environmental noise, low visibility, and the need for real-time performance, making high accuracy and reliability difficult to achieve. This study develops a multimodal classification method that effectively integrates image data with vessel identification numbers to improve the accuracy of automated vessel recognition. A comprehensive recognition system, integrating vessel detection and classification, has been successfully developed and deployed in a maritime monitoring center. With the inclusion of vessel identification numbers as auxiliary data, the system achieved an accuracy of 89% in practical applications. The innovation of this research lies in the application of a multimodal model to address the challenges of vessel recognition, significantly enhancing recognition accuracy and establishing an intelligent recognition system suitable for real-world maritime monitoring and analysis.

Keywords

Maritime Vessel Recognition; Multimodal Image Classification; Image Augmentation; Maritime Intelligence Analysis

References

[1]LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436–444. [2]Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25. [3]Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015. [4]Sleeman IV, W. C., Kapoor, R., & Ghosh, P. (2022). Multimodal classification: Current landscape, taxonomy and future directions. ACM Computing Surveys, 55 (7), 1–31. [5]Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788). [6]Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., ... & Zhang, L. (2024, September). Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In European Conference on Computer Vision (pp. 38–55). Cham: Springer Nature Switzerland. [7]Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313 (5786), 504–507. [8]Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems (pp. 2672–2680). [9]Wang, X., Xie, L., Dong, C., & Shan, Y. (2021). Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1905–1914). [10]Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) (pp. 234–241). Springer, Cham. [11]Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., ... & Girshick, R. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4015–4026)