Research on BERT-Based Chinese Offensive Language Detection
DOI: https://doi.org/10.62517/jbdc.202501417
Author(s)
Xubo Zhang, Rouyi Fan, Xiaofeng Li*
Affiliation(s)
School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou, Henan, China
*Corresponding Author
Abstract
Conventional approaches that rely on dictionaries and rules have been progressively less suitable for real-world use in the task of detecting abusive Chinese language usage. This paper explores the application of the BERT model for Chinese offensive language identification in order to address this problem. The BERT-Base-Chinese pre-trained model is improved by the analysis and processing of a combined collection of offensive Chinese language data. Two independent classification heads are included for subject classification and offensive language identification, respectively, within a parallel multi-task learning architecture that shares low-level feature representations. This method successfully raises the model’s overall performance. Strong support and a useful basis for creating safer and more dependable language-generating systems are offered by this research.
Keywords
Deep Learning; BERT Model; Data Mining; Offensive Language Detection
References
[1] Guo Bolu, Xiong Xuhui. A Survey of Offensive Language Detection Methods Based on Deep Learning. Modern Information Technology, 2022, 6(5):5-10.
[2] Su Jinshu, Zhang Bofeng, Xu Xin. Research Progress on Text Classification Technology Based on Machine Learning. Journal of Software, 2006, 17(9):1848-1859.
[3] Wei Xuanling, Sun Xiang. A Review of Research Progress and Development Trends in Natural Language Processing Technology Methods. China-Arab Science and Technology Forum (Chinese and English), 2025, (5):84-88.
[4] He Xuefeng, Zhou Jie, Chen Deguang, et al. A Survey of Deep Learning Models for Natural Language Processing. Computer Applications and Software, 2025, 42(2):1-19+101.
[5] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30.
[6] Shi Lei, Wang Yi, Cheng Ying, et al. A Survey of Attention Mechanism in Natural Language Processing. Data Analysis and Knowledge Discovery, 2020, 4(5):1-14.
[7] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
[8] Deng J, Zhou J, Sun H, et al. COLD: A Benchmark for Chinese Offensive Language Detection. arXiv preprint arXiv:2201.06025, 2022.
[9] Lu J, Xu B, Zhang X, et al. Facilitating Fine-Grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks. arXiv preprint arXiv:2305.04446, 2023.
[10]Li Da. Research on Intelligent Detection Method of Offensive Language Based on Fusion of Topic and Semantic Features. Guangxi Minzu University, 2024.
[11]Li Hang. Statistical Learning Methods (2nd Edition). Beijing: Tsinghua University Press, 2019.
[12]Wan Kelan. Research on Detection and Identification of Network Aggressive Speech Based on Multi-Task Learning. Sichuan University, 2021. DOI:10.27342/d.cnki.gscdu.2021.000544.
[13]Ruder S. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint arXiv:1706.05098, 2017.
[14]Qiu Xipeng. Neural Networks and Deep Learning. Beijing: Publishing House of Electronics Industry, 2020.
[15]Zhou Zhihua. Machine Learning. Beijing: Tsinghua University Press, 2016.