A Hybrid Extraction Framework for Job Postings Based on XPath and NER
DOI: https://doi.org/10.62517/jike.202604210
Author(s)
Yongyi Lin1,*, Youjiang Zhou1, Xiaoxu Wei2, Hao Sun1
Affiliation(s)
1School of Intelligent Media Engineering, Communication University of China Nanjing, Nanjing, China
2Department of Information and Intelligent Engineering, Shanghai Publishing and Printing College, Shanghai, China
*Corresponding Author
Abstract
With the rapid growth of online recruitment platforms, numerous unstructured job postings are dispersed across websites. Diverse page layouts and unstructured job entity descriptions greatly hinder automated information extraction. Rule-based approaches feature high operational efficiency yet poor generalization capability, whereas NER models excel at semantic comprehension but incur substantial computational overhead. This paper proposes a two-stage hybrid extraction framework. In the first stage, XPath rules are adopted to achieve fast regional positioning, while the second stage employs a BERT-BiLSTM-CRF model to conduct fine-grained entity recognition covering corporate names, job positions and salary intervals. Experimental results demonstrate that the proposed framework achieves superior performance compared with baseline methods in terms of precision, recall and F1-score, which significantly enhances the efficiency of recruitment information mining and intelligent talent-job matching.
Keywords
XPath; Named Entity Recognition; Job Postings; Information Extraction; Deep Learning
References
[1]Ren J. Z. A Study on Text Classification of Job Postings Based on Deep Learning. Journal of Hubei University of Arts and Sciences, 2023, 44(11): 21–27.
[2]Guo W.J, Lv N., Ji S.J, et al. Recruitment information extraction model based on parallel multi-scale features learning. Journal of Shandong University of Science and Technology, 2025, 44 (03): 97-106.
[3]Song R. International Conference on World Wide Web: Learning block importance models for web pages. ACM, 2004:203-211.
[4]Zhou Y., Yin X.J and Yan J.C. RETRACTION: An Information Extraction Method Based on Improved Mixed Text Density Web Pages. Expert Systems, 2024, 42(2): e13796-e13796.
[5]Freitag D, Mccallum A. Information extraction with HMMs and shrinkage. Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. 1999: 31-36.
[6]Lafferty J, Mccallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[7]Isozaki H, Kazawa H. Efficient support vector classifiers for named entity recognition. Proceedings of the 19th International Conference on Computational Linguistics. 2002: 1-7.
[8]Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Net works, 2005, 18(5-6):602-610.
[9]Zhang M.G, Wei G.H. A Review of Information Extraction Research in the Medical Field. Computer Engineering and Applications, 2026:1-38.
[10]Yang S.Y, Li G.H, Dong J, et al. AGMF-NER: A Chinese Named Entity Recognition Model with Adaptive Feature Fusion. Computer Engineering and Applications, 2026:1-17.