Inicio  /  Informatics  /  Vol: 10 Par: 4 (2023)  /  Artículo
ARTÍCULO
TITULO

A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study?Health Disparities

Fan Zhang    
Melissa Petersen    
Leigh Johnson    
James Hall    
Raymond F. Palmer    
Sid E. O?Bryant and on behalf of the Health and Aging Brain Study (HABS?HD) Study Team    

Resumen

The Health and Aging Brain Study?Health Disparities (HABS?HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS?HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS?HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS?HD. Therefore, we proposed a three-step workflow to handle missing data in HABS?HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS?HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS?HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS?HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer?s disease models. They can also be applied to other disease data analyses.

 Artículos similares

       
 
Myoung-Su Choi, Dong-Hun Han, Jun-Woo Choi and Min-Soo Kang    
Sleep apnea has emerged as a significant health issue in modern society, with self-diagnosis and effective management becoming increasingly important. Among the most renowned methods for self-diagnosis, the STOP-BANG questionnaire is widely recognized as... ver más
Revista: Applied Sciences

 
Max Schrötter, Andreas Niemann and Bettina Schnor    
Over the last few years, a plethora of papers presenting machine-learning-based approaches for intrusion detection have been published. However, the majority of those papers do not compare their results with a proper baseline of a signature-based intrusi... ver más
Revista: Information

 
Saikat Das, Mohammad Ashrafuzzaman, Frederick T. Sheldon and Sajjan Shiva    
The distributed denial of service (DDoS) attack is one of the most pernicious threats in cyberspace. Catastrophic failures over the past two decades have resulted in catastrophic and costly disruption of services across all sectors and critical infrastru... ver más
Revista: Algorithms

 
Xiaohui Yan, Tianqi Zhang, Wenying Du, Qingjia Meng, Xinghan Xu and Xiang Zhao    
Water quality prediction, a well-established field with broad implications across various sectors, is thoroughly examined in this comprehensive review. Through an exhaustive analysis of over 170 studies conducted in the last five years, we focus on the a... ver más

 
Eike Blomeier, Sebastian Schmidt and Bernd Resch    
In the early stages of a disaster caused by a natural hazard (e.g., flood), the amount of available and useful information is low. To fill this informational gap, emergency responders are increasingly using data from geo-social media to gain insights fro... ver más
Revista: Information