Colón Vargas, Mónica

Loading...
Profile Picture

Publication Search Results

Now showing 1 - 1 of 1
  • Publication
    Applied predictive modeling on personal key indicators of heart disease for residents of the United States
    (2023-05-11) Colón Vargas, Mónica; Lorenzo González, Edgardo; College of Arts and Sciences - Sciences; Santana Morant, Dámaris; Colón Ramírez, Silvestre; Department of Mathematics; Andrade Rengifo, Fabio
    Heart disease (HD) is considered one of the leading causes of death in the United States. According to the CDC, in 2020, approximately 697,000 people in the US, died from HD. Many studies have shown a variety of risk factors for HD. Some include high cholesterol, smoking, high blood pressure, obesity, diabetes, kidney disease, and more. Additionally, demographics like race and sex have been found to increase the risk for HD. In this study, the goal is to implement predicting modeling tools to build a model that predicts HD based on some personal key indicators. Since an individual can self-measure the predictors, the built model will allow the prediction of HD without the requirement of a medical exam. The HD data set was obtained from the Centers for Disease Control and Prevention (CDC) consisting of n= 319,795 observations and p=17 predictors. The data set has a very imbalanced structure, resulting in increasing the difficulty to obtain a good model. To deal with the imbalance problem, re-sampling techniques (Upsampling, SMOTE, and downsampling) were applied to obtain a balanced data set. After evaluating the re-sampling techniques, upsampling was chosen to deal with the imbalance. Simple models were fit to the data to predict HD and then different techniques to improve the models were tried. Bagging, boosting, and regularization, which introduce bias to the model and decrease the variance, were used. The methods used to model HD consisted of four linear methods (logistic regression, Ridge, LASSO, and Elastic Net logistic regression) and four tree-based methods (Decision Tree, Random Forests, AdaBoost, and XGBoost). Among the variables considered, XGBoost was demonstrated to be more effective in terms of AUC in predicting HD. Therefore, a weighted XGBoost model was fitted to the data, without the upsampling technique. The purpose was to study how weights affect the imbalance in the data set. It was concluded that a weighted XGBoost model is more effective to predict HD with the variables studied.