Applications and Challenges of Artificial Intelligence in Data Cleaning
DOI: https://doi.org/10.62517/jike.202604215
Author(s)
Pusen Gao
Affiliation(s)
The University of Melbourne, VIC 3000, Melbourne, Australia
Abstract
Data quality issues have long been a key factor affecting model performance and the reliability of decision-making. As a critical component of data preprocessing, data cleansing is gradually shifting from traditional rule-based and statistical methods toward AI-driven automated approaches. This paper provides a systematic review of the application of artificial intelligence in data cleansing. It classifies and analyzes relevant methods across three dimensions-machine learning, deep learning, and large language models-and introduces an evaluation method based on data quality dimensions. The paper compares and analyzes different technologies in terms of completeness, accuracy, and consistency, and summarizes their application effectiveness in tasks such as anomaly detection, missing value imputation, and entity matching.
Keywords
Data Cleaning; Artificial Intelligence; Machine Learning; Deep Learning; Large Language Models; Data Quality; Missing Value Imputation; Anomaly Detection
References
[1] Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. Proceedings of the 2000 ACM SIGMOD.
[2] Wang, Z., Akande, O., Poulos, J., & Li, F. (2021). Are deep learning models superior for missing data imputation in large surveys? arXiv.
[3] Camino, R. D., Hammerschmidt, C. A., & State, R. (2019). Improving missing data imputation with deep generative models. arXiv.
[4] Zhu, Y., et al. (2024). Large language models for data cleaning. arXiv.
[5] Lee, J., et al. (2025). Leveraging large language models for clinical data cleaning. arXiv.
[6] Little, R. J. A., & Rubin, D. B. (2019). Statistical analysis with missing data (3rd ed.). Wiley.
[7] Côté, P.-O., et al. (2024). Data cleaning and machine learning: A systematic literature review. arXiv.
[8] Rekatsinas, T., et al. (2017). HoloClean: Holistic data repairs with probabilistic inference. VLDB.
[9] Hildebrandt, T., et al. (2016). DataXFormer: Data cleaning and transformation for data integration. IEEE ICDE.