A Hybrid Extraction Framework for Job Postings Based on XPath and NER_Vol. 4 No. 2 (JIKE 2026)_Journal of Intelligence and Knowledge Engineering (ISSN: 2959-0620)

Home > Journal of Intelligence and Knowledge Engineering (ISSN: 2959-0620) > Vol. 4 No. 2 (JIKE 2026) >

A Hybrid Extraction Framework for Job Postings Based on XPath and NER

DOI: https://doi.org/10.62517/jike.202604210

Author(s)

Yongyi Lin1,*, Youjiang Zhou1, Xiaoxu Wei2, Hao Sun1

Affiliation(s)

1School of Intelligent Media Engineering, Communication University of China Nanjing, Nanjing, China 2Department of Information and Intelligent Engineering, Shanghai Publishing and Printing College, Shanghai, China *Corresponding Author

Abstract

With the rapid growth of online recruitment platforms, numerous unstructured job postings are dispersed across websites. Diverse page layouts and unstructured job entity descriptions greatly hinder automated information extraction. Rule-based approaches feature high operational efficiency yet poor generalization capability, whereas NER models excel at semantic comprehension but incur substantial computational overhead. This paper proposes a two-stage hybrid extraction framework. In the first stage, XPath rules are adopted to achieve fast regional positioning, while the second stage employs a BERT-BiLSTM-CRF model to conduct fine-grained entity recognition covering corporate names, job positions and salary intervals. Experimental results demonstrate that the proposed framework achieves superior performance compared with baseline methods in terms of precision, recall and F1-score, which significantly enhances the efficiency of recruitment information mining and intelligent talent-job matching.

Keywords

XPath; Named Entity Recognition; Job Postings; Information Extraction; Deep Learning

References

[1]Ren J. Z. A Study on Text Classification of Job Postings Based on Deep Learning. Journal of Hubei University of Arts and Sciences, 2023, 44(11): 21–27. [2]Guo W.J, Lv N., Ji S.J, et al. Recruitment information extraction model based on parallel multi-scale features learning. Journal of Shandong University of Science and Technology, 2025, 44 (03): 97-106. [3]Song R. International Conference on World Wide Web: Learning block importance models for web pages. ACM, 2004:203-211. [4]Zhou Y., Yin X.J and Yan J.C. RETRACTION: An Information Extraction Method Based on Improved Mixed Text Density Web Pages. Expert Systems, 2024, 42(2): e13796-e13796. [5]Freitag D, Mccallum A. Information extraction with HMMs and shrinkage. Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. 1999: 31-36. [6]Lafferty J, Mccallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289. [7]Isozaki H, Kazawa H. Efficient support vector classifiers for named entity recognition. Proceedings of the 19th International Conference on Computational Linguistics. 2002: 1-7. [8]Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Net works, 2005, 18(5-6):602-610. [9]Zhang M.G, Wei G.H. A Review of Information Extraction Research in the Medical Field. Computer Engineering and Applications, 2026:1-38. [10]Yang S.Y, Li G.H, Dong J, et al. AGMF-NER: A Chinese Named Entity Recognition Model with Adaptive Feature Fusion. Computer Engineering and Applications, 2026:1-17.