Science, Technologies, Innovations №3(35) 2025, 61-69 p

http://doi.org/10.35668/2520-6524-2025-3-07

Melnyk A. O. — PhD Student, Taras Shevchenko National University of Kyiv, 4d Akademika Glushkova Avenue, Kyiv, Ukraine, 02000; anastasiia.melnyk@knu.ua; ORCID: 0000-0002-3167-4353

Rozora I. V. — D. Sc. in Physics and Mathematics, Taras Shevchenko National University of Kyiv, 4d Akademika Glushkova Avenue, Kyiv, Ukraine, 02000; +38 (044) 521-35-35; irozora@knu.ua; ORCID: 0000-0002-8733-7559

COMPARATIVE ANALYSIS OF IMPUTATION METHODS IN MACHINE LEARNING MODELS

Abstract. Missing data is a prevalent issue in machine learning and data analysis that impacts the credibility and performance of predictive models. This article provides a comprehensive study of missing data, its types, consequences, and popular imputation methods. Using real datasets, we compare the performance of Mean/Median Imputation, K-Nearest Neighbors (KNN) Imputation, Multiple Imputation, Regression Imputation, and Hot Deck Imputation. Furthermore, we study how these imputation techniques affect machine learning models such as Random Forest, Gradient Boosting Machines (GBM), and Support Vector Machines (SVM). Our study emphasizes the need for careful experimentation and model-specific investigation when handling missing data, where an important part is played by the selection of suitable imputation techniques based on dataset attributes and machine learning models. Lastly, our findings underscore the importance of tailored imputation strategies in enhancing model fit and ensuring stable analytical findings.

Keywords: missing data, imputation methods, machine learning, evaluation metrics, predictive models.

REFERENCES

Auria, L., & Moro, R. A. (2008). Support vector machines (SVM) as a technique for solvency analysis. German Institute for Economic Research. Berlin, 19 p. Retrieved from: https://www.econstor.eu/bitstream/10419/27334/1/576821438.PDF.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, 549-576.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
Groenwold, R. H., & Dekkers, O. M. (2020). Missing data: the impact of what is not there. European journal of endocrinology, 183(4), E7-E9.
Chhabra, G., Vashisht, V., & Ranjan, J. (2017). A comparison of multiple imputation methods for data with missing values. Indian Journal of Science and Technology, 10(19), 1-7.
Engels, J. M., & Diehr, P. (2003). Imputation of missing longitudinal data: a comparison of methods. Journal of clinical epidemiology, 56(10), 968-976.
Malarvizhi, R., & Thanamani, A. S. (2012). K-nearest neighbor in missing data imputation. International Journal of Engineering Research and Development, 5(1), 5-7.
Feelders, A. (1999, September). Handling missing data in trees: Surrogate splits or statistical imputation? In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 329-334). Berlin, Heidelberg.
Templ, M., Kowarik, A., & Filzmoser, P. (2011). Iterative stepwise regression imputation using standard and robust methods. Computational Statistics & Data Analysis, 55(10), 2793-2806.
Sullivan, D., & Andridge, R. (2015). A hot deck imputation procedure for multiply imputing nonignorable missing data: The proxy pattern-mixture hot deck. Computational statistics & data analysis, 82, 173-185.
Jäger, S., Allhorn, A., & Bießmann, F. (2021). A benchmark for data imputation methods. Frontiers in big Data, 4, 693674.
Hodson, T. O. (2022). Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geoscientific Model Development Discussions, 2022, 1-10.
Alpaydin, E. (2021). Machine learning. Mit Press.
Mosavi, A., Ozturk, P., & Chau, K. W. (2018). Flood prediction using machine learning models: Literature review. Water, 10 (11), 1536.
Rigatti, S. J. (2017). Random forest. Journal of Insurance Medicine, 47(1), 31-39.
Natekin, A., & Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 7,
Auria, L., & Moro, R. A. (2008). Support vector machines (SVM) as a technique for solvency analysis.
(2023) Credit Card Fraud Detection Dataset 2023. Kaggle. Retrieved from: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023.
Saabith, A. S., Vinothraj, T., & Fareez, M. (2020). Popular python libraries and their application domains. International Journal of Advance Engineering and Research Development, 7(11).

Science, Technologies, Innovations №3(35) 2025, 61-69 p

Archive

Founders