In the current study, machine learning algorithms were applied to predict hospital mortality using a prediction model based on the demographic, clinical predictors, comorbidities, and biochemical markers of patients with COVID-19. The two-component SIMPLS-based prediction model had moderate predictive power Q2 = 0.24 to predict hospital mortality. The prediction model was associated with high accuracy (AUC score of 0.91–0.95) using training and validation sets of the patient cohort. The prediction model was developed based on the 18 clinical and comorbidities, and 3 paraclinical biochemical markers uncovering most differentiating predictors that some have not been recognized through conventional statistical methods. Hence, CAD showed the highest predictive importance for in-hospital death, followed by diabetes, age > 65, Altered Mental Status, dementia, and O2 saturation < 88%. Also, LCA clustering was successful to identify high- and low-risk clusters in COVID-19 survivors. The clusters were discriminated against based on the high predictive power model Q2 = 0.69. Age < 65, lack of hypertension, and lack of diabetes were highly correlated with a lower rate of mortality among survivors while residing in the nursing home, age > 65, AMS, stroke, atrial fibrillation, CAD, and dementia were risk factors for in-hospital mortality in COVID-19 survivors. Multivariate analysis demonstrated that there are some most differentiating predictors which are not included in the univariate method (Table 1) such as yno2, dyspnea, alcohol, O2 saturation, and stroke. Moreover, the multivariate analysis helped to determine the weight of the clinical predictors based on their importance in the prediction model (VIP) that is considered as the value of multivariate analysis compared to the univariate analysis. On the other hand, acute MI, CHF, O2 flow rate (lpm), Fio2, and blood pressure were significantly different between the two groups which were not selected as most differentiating predictors using SIMPLS. The combination of paraclinical data with patient demographics and comorbidities significantly improved the prediction of hospital mortality compared to when patient demographics and comorbidities or paraclinical data were independently poor predictors for the prognosis of hospital mortality. Lactate, CRP, and prothrombin were the most weighted biochemical variables that could be contributed to predicting hospital mortality.
Several other studies are published on COVID-19 mortality prediction model development. In a large cohort, Yadaw et al. developed a highly accurate (AUC = 0.91) ML-based mortality prediction model, using patient’s age, O2 saturation throughout their medical encounter, and type of patient encounter (inpatient versus outpatient and telehealth visits) [14]. Age and minimum O2 saturation during the encounter were the most predictive factors, which is in line with our results. Individuals aged 60 years and older represent nearly 85% of all deaths, in COVID-19 hot spots across the USA [15]. Not surprisingly, the severity of hypoxia at presentation has been extensively reported as a significant indicator of the severity of illness, specifically in acute respiratory distress syndrome, and carries strong justification to be an important predictive factor in the clinical course of COVID-19 [16, 17]. Although development and validation datasets were larger in this study, the collected data were limited to those routinely collected during hospital encounters and did not include the comprehensive list of demographics, comorbidities, biochemical tests, imaging, and omics data. Additionally, although they had large datasets, the number of dead participants was small. Knight et al. conducted a large prospective cohort, evaluating an 8-item scoring system (score range 0–21 points) for in-hospital mortality due to COVID-19 [18]. The variables included age, gender, number of comorbidities, respiratory rate, O2 saturation, level of consciousness, urea level, and CRP. This scoring system revealed high discrimination for mortality (derivation cohort: AUC 0.79; validation cohort: 0.77); however, some potentially relevant comorbidities such as hypertension, previous myocardial infarction, and stroke were not included in data collection. Moreover, regarding the 32.2% mortality rate and elderly patient population (median age of 73 years old), this model could function differently in younger patients and/or populations at lower risk of death.
LASSO and multivariate data analysis-based prediction models showed that higher age, coronary heart disease (CHD), percentage of lymphocytes (LYM%), procalcitonin (PCT), urea, CRP, and D-dimer (DD) could be potential risk factors for mortality of COVID. These variables could classify the COVID patients into low- and high-risk groups using a good prediction model (AUC = 0.91)[19].
Considerable heterogenicity exists among COVID-19 mortality prediction models. Unlike our results which showed paraclinical and biochemical data have limited predictive value, in the model developed by Zhao et al. (AUC 0.83), lactate dehydrogenase and procalcitonin were among the top mortality prediction factors [20], and the COVID-AID study showed that renal failure at presentation (defined by creatinine > 2 mg/dL), regardless of chronicity has a high impact on in-hospital mortality in hospitalized COVID-19 patients [21]. Recent studies have reported that prothrombin and CRP are associated with COIVD severity and mortality [22, 23]. In this study, we showed the correlation of decreased O2 and increased lactate that may indicate the higher level of the anaerobic metabolism [24] in patients with COVID-19 that are associated with mortality.
Late April 2020, a systematic review and meta-analysis showed a significantly higher rate of hypertension, diabetes, cardiovascular disease, and respiratory disease in critically ill COVID patients compared to non-critical patients [25]. Then, another systematic review and meta-analysis on risk for predicting mortality of COVID 19 patients demonstrated that dyspnea, chest tightness, hemoptysis, expectoration, and fatigue were the most significant clinical variables in association with increased risk of COVID-19 mortality. This study also showed significant increased leukocyte count and decreased lymphocyte count in non-survivors [26]. ML was successfully applied to determine COVID-19 severity by predicting the need for ICU (AUC = 0.80) and the need for mechanical ventilation (AUC = 0.82) [27]. Random forest analysis showed that PCT, DD, CRP, respiratory rate, SpO2, albumin, AST/SGOT, calcium, influenza-like symptoms, and ALT/SGPT are the most important variables to predict the need for ICU. Also, CRP, DD, PCT, SpO2, respiratory rate, creatinine, total protein, albumin, calcium, and age were the most important variables to predict the need for mechanical ventilation [27]. In a similar study, SpO2/FiO2, CRP, estimated glomerular filtration rate (eGFR), age, Charlson score, lymphocyte count, and PCT were the most important variables for the prediction COVID severity [28]. LASSO-based prediction model showed that lymphocyte percentage, lactic dehydrogenase (LDH), neutrophil count, and DD in combination with four quantitative CT findings including pneumonia percentage in the lateral basal segment of left lower lung, the volume of the whole lung with the density of -300 to -200 HU, pneumonia volume in both lungs and pneumonia volume in the right lung can be most important variables to prognosticate critical illness risk in hospitalized patients with COVID-19 pneumonia [29]. Age, PCT, CRP, LDH, DD, and lymphocytes were top mortality predictors and PCT, LDH, CRP, O2 saturation, temperature, and ferritin were important predictors for the ICU need with AUC 89% and 79%, respectively, in a cohort from New York [30].
Leon et al. applied the ML approach to cluster the patients with COVID into 3 groups including higher, moderate, and low rate of mortality. This study showed that the higher and lower AST, ALT, LDH, CRP, and number of neutrophils were associated with a higher and lower rate of mortality, respectively [31]. The percentages of monocytes and lymphocytes were negatively correlated with mortality [31]. Unlike our results, Leon’s study showed that age, sex, and comorbidities did not contribute to the above clustering model [31].
The strengths of our study include assessing a comprehensive list of demographic, clinical, and paraclinical variables, at all stages of hospitalization (admission, during hospital stay, and hospital discharge), development of an internally validated accurately discriminating in-hospital mortality prediction model, identification of high-risk and low-risk clusters of COVID patients whose healthcare needs are different, and enrollment of PCR-proven cases of SARS-CoV2, rather than possible COVID-19 patients. SIMPLS is considered a suitable multivariate method to investigate big and complex datasets that have a relatively small sample size and many variables [32]. External validation using an external cohort may help the results to be more practicable and achievable at any time with any cohorts. Current findings in this study may improve the precise prognostication of COVID-19 mortality, classification of low and high risk, and identification of potential risk factors.
Our study has a few limitations. First, this is a single-center retrospective study, which might impact the data quality and generalizability. Second, although we had an acceptable sample size, the subset of dead individuals was small (n = 31). A major reason for this concern is that the number of predictor parameters considered by ML approaches usually exceeds that for regression, even when the same set of predictors is applied, especially since multiple interaction terms are constantly examined and continuous predictors are routinely classified. Therefore, ML methodologies require “big data” to ensure their developed models have minimized overfitting and for their potential advantages (i.e., dealing with highly nonlinear relations and complex interactions) to reach fruition.