STEMM Institute Press
Science, Technology, Engineering, Management and Medicine
Document Image Layout Analysis via MASK Constraint
DOI: https://doi.org/10.62517/jbdc.202401204
Author(s)
Jun He1, Hanjie Zheng2, Tianlong Ma1,*
Affiliation(s)
1East China Normal University, Shanghai, China 2Vernon Secondary School, Vernon, Canada *Corresponding Author.
Abstract
Document layout analysis plays an essential role in computer vision. With the development of deep learning, more and more deep learning methods are proposed to solve some challenges in document layout analysis. Semantic segmentation-based and object detection-based methods are two mainstream approaches for document layout analysis. Compared with methods based on semantic segmentation, methods based on target detection have certain advantages in ensuring the integrity of target objects, especially with the proposal of Mask R-CNN. However, since the document layout analysis task is different from the general target detection task, there is a particular semantic gap in the document layout analysis (i.e the image to be detected may contain text), and the Mask R-CNN cannot solve this problem well. Therefore, we design a hierarchical information augmentation module, which can fully utilize low-dimensional detail information and maintain high-dimensional semantic information. In addition, we propose a novel MASK-constrained module, which ensures that the global semantic information of the input module can be further mined by embedding MASK information in the input image. Furthermore, to combat the issue of overlapping bounding boxes arising from Mask R-CNN processing, we propose a Constrained Aggregation method. Finally, we validate our approach using benchmark datasets featuring complex layouts (such as DSSE-200 and FPD). The results underscore the significant performance gains achievable with our proposed method.
Keywords
Document Layout Analysis; Computer Vision; Semantic Segmentation; Object Detection
References
[1] Yang, X.; Yumer, E.; Asente, P.; Kraley, M.; Kifer, D.; Lee Giles, C. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5315–5324. [2] Clark, C.; Divvala, S. Pdffigures 2.0: Mining figures from research papers. In Proceedings of the ACM/IEEE on Joint Conference on Digital Libraries, 2016, pp. 143–152. [3] Clark, C.A.; Divvala, S. Looking beyond text: Extracting figures, tables and captions from computer science papers. In Proceedings of the Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [4] Praczyk, P.A.; Nogueras-Iso,J. Automatic extraction of figures from scientific publications in high-energy physics. Information Technology and Libraries 2013, 32, 25–52. [5] Li, Y.; Zou, Y.; Ma, J. Deeplayout: A semantic segmentation approach to page layout analysis. In Proceedings of the International Conference on Intelligent Computing. Springer, 2018, pp. 266–277. [6] Ma, W.; Zhang, H.; Jin, L.; Wu, S.; Wang, J.; Wang, Y. Joint layout analysis, character detection and recognition for historical document digitization. In Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, 2020, pp. 31–36. [7] Wu, X.; Zheng, Y.; Ma, T.; Ye, H.; He, L. Document image layout analysis via explicit edge embedding network. Information Sciences 2021, 577, 436–448. [8] He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 2015, 37, 1904–1916. [9] Ma, T.; Wu, X.; Du, X.; Wang, Y.; Jin, C. Image Layer Modeling for Complex Document Layout Generation. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023, pp. 2261–2266. [10] Loc, C.V.; Burie, J.C.; Ogier, J.M. Document images watermarking for security issue using fully convolutional networks. In Proceedings of the ICPR, 2018, pp. 1091–1096. [11] Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 2481–2495. [12] Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. British Machine Vision Conference 2018. [13] Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia,J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890. [14] Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, 2018, pp. 801–818.
Copyright @ 2020-2035 STEMM Institute Press All Rights Reserved