PCA-Integrated LightGBM and XGBoost Model for Pattern Recognition and Interpretability Analysis of Telecom Fraud_Vol. 3 No. 4 (JBDC 2025)_Journal of Big Data and Computing (ISSN: 2959-0590)

Home > Journal of Big Data and Computing (ISSN: 2959-0590) > Vol. 3 No. 4 (JBDC 2025) >

PCA-Integrated LightGBM and XGBoost Model for Pattern Recognition and Interpretability Analysis of Telecom Fraud

Download PDF

DOI: https://doi.org/10.62517/jbdc.202501407

Author(s)

Donghao Li*, Xiaohan Wang, Xiaoyu Lu, Bing He

Affiliation(s)

Henan University of Technology, Zhengzhou, Henan, China *Corresponding Author

Abstract

In response to the increasing complexity of telecom network fraud and issues such as high-dimensional imbalanced data, an integrated model based on LightGBM and XGBoost is proposed in this paper. The prediction results are fused using Principal Component Analysis (PCA), and model interpretability is enhanced through SHAP values. First, raw transaction data are preprocessed and subjected to feature engineering. Then, model parameters are optimized via cross-validation, constructing a fraud detection pathway of "identification–interpretation–integration". Experimental results show that the PCA-fused model outperforms individual models in both detection performance and interpretability, providing an effective intelligent solution for accurate telecom fraud detection.

Keywords

Telecom Fraud; Ensemble Learning; PCA Fusion; SHAP Interpretation; LightGBM; XGBoost

References

[1]China Academy of Information and Communications Technology. White Paper on Prevention and Governance of Telecom and Online Fraud. 2023. [2]Tianpei XU, Yongsheng LUO. A Credit Card Fraud Detection Model Based on Ensemble Learning. Information System Engineering, 2024, (01): 129-132. [3]Y Zhang, et al. Gradient Boosting Machines: A Survey. ACM Computing Surveys, 2020, 53(5): 1–30. [4]LightGBM Documentation. Microsoft, 2023. [5]T Chen, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016: 785–794. [6]Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001. [7]IEEE-CIS Fraud Detection Dataset. Kaggle, 2019. [8]Heng WANG, Yanan JIANG, Xin ZHANG, et al. A Lithology Identification Method Based on the Gradient Boosting Algorithm. Journal of Jilin University (Earth Science Edition), 2021, (03): 940-950. [9]Lundberg S M, Lee S I. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems, 2017: 4765–4774. [10]Wei ZHAO, Ming LI, Yue SUN. An Interpretable Fraud Detection Framework Combining GBDT and SHAP for Highly Imbalanced Data. Journal of Electronics & Information Technology, 2023, 45(8): 2801-2810. [11]Jolliffe I T. Principal Component Analysis. Springer, 2002.