Predicting neurological outcome after out-of-hospital cardiac arrest with cumulative information; development and internal validation of an artificial neural network algorithm

Background Prognostication of neurological outcome in patients who remain comatose after cardiac arrest resuscitation is complex. Clinical variables, as well as biomarkers of brain injury, cardiac injury, and systemic inflammation, all yield some prognostic value. We hypothesised that cumulative information obtained during the first three days of intensive care could produce a reliable model for predicting neurological outcome following out-of-hospital cardiac arrest (OHCA) using artificial neural network (ANN) with and without biomarkers. Methods We performed a post hoc analysis of 932 patients from the Target Temperature Management trial. We focused on comatose patients at 24, 48, and 72 h post-cardiac arrest and excluded patients who were awake or deceased at these time points. 80% of the patients were allocated for model development (training set) and 20% for internal validation (test set). To investigate the prognostic potential of different levels of biomarkers (clinically available and research-grade), patients’ background information, and intensive care observation and treatment, we created three models for each time point: (1) clinical variables, (2) adding clinically accessible biomarkers, e.g., neuron-specific enolase (NSE) and (3) adding research-grade biomarkers, e.g., neurofilament light (NFL). Patient outcome was the dichotomised Cerebral Performance Category (CPC) at six months; a good outcome was defined as CPC 1–2 whilst a poor outcome was defined as CPC 3–5. The area under the receiver operating characteristic curve (AUROC) was calculated for all test sets. Results AUROC remained below 90% when using only clinical variables throughout the first three days in the ICU. Adding clinically accessible biomarkers such as NSE, AUROC increased from 82 to 94% (p < 0.01). The prognostic accuracy remained excellent from day 1 to day 3 with an AUROC at approximately 95% when adding research-grade biomarkers. The models which included NSE after 72 h and NFL on any of the three days had a low risk of false-positive predictions while retaining a low number of false-negative predictions. Conclusions In this exploratory study, ANNs provided good to excellent prognostic accuracy in predicting neurological outcome in comatose patients post OHCA. The models which included NSE after 72 h and NFL on all days showed promising prognostic performance. Supplementary Information The online version contains supplementary material available at 10.1186/s13054-021-03505-9.


Introduction
To estimate the prognosis for long-term neurological recovery in patients who remain comatose during the first few days after resuscitation from a cardiac arrest is a common and important part of intensive care. Patients' background, cardiac arrest characteristics, vital signs on hospital admission, and findings from diagnostic investigations are all contributing factors which make prognostication complex [1]. There is a need for robust and reliable methods to analyse data and assist in prognostication where a full recovery or severe neurological deficits are possible long-term clinical outcomes.
The prognostication process should, according to the latest guidelines, be deferred for at least 72 h after the return of spontaneous circulation (ROSC) and should be multimodal [2,3]. The clinical neurological examination is the foundation of this process and is supported by electroencephalography (EEG), somatosensory-evoked potentials (SSEP), neuroradiological imaging, and one biomarker. Although demographic and clinical variables carry important prognostic information, none are included in current algorithms [4].
In recent years, a number of biomarkers have emerged, which potentially could improve current algorithms for the prediction of neurological outcome. Today, only neuron-specific enolase (NSE) is included in the guidelines [4]. Amongst many novel biomarkers that have been studied for brain injury after cardiac arrest, the most promising so far is neurofilament light (NFL), with an area under the receiver operating characteristic curve (AUROC) of 94-98% for discrimination of longterm neurological outcome as early as 24 h after ROSC [5,6]. Other biomarkers of brain injury, including S100 calcium-binding protein B (S100B), tau protein, glial fibrillary acidic protein (GFAP), and ubiquitin C-terminal hydrolase-L1 (UCHL1), have also shown potential in cardiac arrest prognostication [7][8][9][10][11]. Biomarkers of cardiac injury such as troponin T (TnT), N-terminal pro-B-type natriuretic peptide (BNP) and copeptin along with biomarkers of systemic inflammation such as procalcitonin (PCT) and interleukin-6 (IL-6) are also associated with neurological outcome [12][13][14][15][16]. Some of these biomarkers are routinely measured in many laboratories, while others are not. Despite substantial efforts to determine promising biomarkers, the prognostic value of combining and adding them to cumulative clinical data remains unclear [12].
Improvements in machine learning algorithms and increased computational power have led to an enhanced diagnostic and prognostic capability in a variety of medical fields, ranging from radiology to intensive care medicine [17,18]. Machine learning has also shown promising results in short-term and long-term prognostication in survivors of out-of-hospital cardiac arrest (OHCA) [19,20]. In a recently published study, we showed how a supervised machine learning algorithm called artificial neural networks (ANN) was superior to logistic regression when predicting long-term neurological outcome including survival, based on information available on hospital admission [20].
In this extension of our previous study, we hypothesised that cumulative information obtained during the first three days of intensive care could produce a reliable model for predicting neurological outcome post OHCA using ANN with and without biomarkers.

Study population and variables
We included all 939 patients from the primary analysis of the Target Temperature Management (TTM) trial [21], which randomised unconscious OHCA survivors to compare two target temperatures of 33 °C and 36 °C upon ICU admission. Patients from 36 ICUs across Europe and Australia were enrolled between 2010 and 2013. The inclusion criteria were comatose (Glasgow Coma Scale (GCS) ≤ 8) adults (≥ 18 years of age) with a sustained ROSC after resuscitation from OHCA of presumed cardiac cause. The primary outcome was mortality until the end of the trial, which did not differ significantly between the temperature groups [21]. The trial protocol was approved by the ethical committees in each participating country, and informed consent was either waived or obtained from all participants or relatives according to the national legislation, in line with the Helsinki declaration [22,23].
Patients without follow-up at six months or an extensively high number of missing values (> 40 missing values on hospital admission) were excluded (n = 7) from the final data analysis. We studied three different time points after cardiac arrest (24,48 and 72 hours). To

Conclusions:
In this exploratory study, ANNs provided good to excellent prognostic accuracy in predicting neurological outcome in comatose patients post OHCA. The models which included NSE after 72 h and NFL on all days showed promising prognostic performance. Keywords: Machine learning, Artificial intelligence, Artificial neural networks, Neural networks, Out-of-hospital cardiac arrest, Cardiac arrest, Cerebral performance category, Critical care, Intensive care, Prediction, Prognostication limit prognostication to comatose patients, we excluded patients who were awake or deceased at these times ( Fig. 1). Awake was defined as GCS motor response score of 6 (measured on a daily basis), where the patient obeys commands for movement or Cerebral Performance Category (CPC) 1-3 at ICU discharge.
All variables from the TTM-trial up to day 3 were included; background information, prehospital and hospital admission records along with data obtained at 24 h (day 1), 48 h (day 2), 72 h (day 3). All variables are displayed in Additional file 1: table s1A; Additional file 2: table s1b; Additional file 3: table s1c; Additional file 4: table s1D in the electronic supplement. Computed tomography (CT), magnetic resonance imaging (MRI), EEG and SSEP were not included as these modalities were used in a minority of patients.
The TTM-trial biobank collected blood samples from 29 of the 36 trial sites on day 1,2 and 3 and comprised approximately 70% of the total TTM-trial patient population. Biomarkers analysed in the biobank were grouped by whether they were considered clinically accessible or research-grade. Three models (A, B and C) were developed for each of the three days studied (a total of nine datasets): To ensure that the prognostic value of the biomarkers would not be weakened by the imputation technique, we excluded patients with missing values corresponding to the exact day the data was missing for NSE and NFL in level B and C, respectively, which resulted in level B and C having approximately 30% fewer patients in each dataset.

Outcome
The outcome was a dichotomised CPC scale graded by a blinded assessor after an interview face-to-face or by telephone at the six-month follow-up [24]. A good outcome was defined as CPC 1-2 and poor outcome as CPC 3-5. A good outcome means independence in activities of daily living but may include minor disability. A poor outcome means severe brain injury; dependence on others, coma or death [25].

Model development
To ensure an unbiased model development and independent internal validation, 80% of the patients were randomly allocated to development (training set), and 20% were allocated to validation (test set). The training/test split remained constant throughout the model development. We aimed to create prediction models for long-term neurological outcome based on background, prehospital, and hospital admission data along with available ICU information obtained on day 1, 2 and 3 after admission based on the following levels: A) without biomarkers, B) adding clinically accessible biomarkers and C) adding research-grade biomarkers.
For each model, we used the following plan for variable selection and missing value imputation: We used an ANN to predict the neurological outcome at six months. An ANN consists of an input layer, a number of hidden layers and an output layer. These layers consist of nodes which can aggregate information from previous layers, transform it, and then send it forward to the next layer. The aim is to mimic the complex network of connected neurons in the human nervous system and thereby detect patterns and dependencies between variables and outcome to improve prognostic performance. We used fivefold cross-validation and a Bayesian optimisation algorithm for hyperparameter tuning during development to find the best possible model within the following constraints: 1-3 hidden layers, 5-250 hidden nodes in each layer, batch size between 4 and 128, a drop-out rate between 0 and 0.5 for the input layer and 0-0.5 for the hidden layers, a fixed learning rate of 10 -3 , and the activation function for the hidden layers was chosen to be either the rectified linear unit (ReLU) or the hyperbolic tangent function. We used either L 1 , L 2 or max-norm for norm-regularisation. All networks were trained using early stopping with a patience of 30 epochs and a maximum number of 1000 epochs. We used binary cross entropy as our loss function. The sigmoid activation function was used for the output layer. We chose the model with the highest mean AUROC of the cross-validations for further analysis.
After the model development, we applied Shapley additive explanation algorithm (SHAP) to all models to visualise which variables explained an individual prediction and to understand the relative contribution of variables. The SHAP algorithm is based on Shapley values, which  Table 1 are based on 'Patients included for further data analysis (n = 932)' . TTM, target temperature management. ICU, intensive care unit originate from cooperative game theory, and explains how much a single variable contributes to the difference between the actual prediction and the mean of all predictions. The SHAP algorithm can help explain how a prediction model works and mitigate some of the concerns about "black box" modelling. We created one patient example to illustrate the explanation of a patient-specific prediction on day 1-3 (level C). For all nine models, we calculated the mean of the absolute SHAP values for each variable and displayed it using a bar plot to rank the variables for each model.

Statistical analysis methods
All continuous variables are presented as medians with interquartile ranges (IQR). Categorical variables are presented as numbers and percentages. Missing data are presented as percentages. The Mann-Whitney U test was used for comparison between groups of continuous data, and Fisher's exact test was used for categorical data. We evaluated the prediction models using the receiver operating characteristic (ROC) curve and calculated the AUROC for all models based on test sets. To evaluate the prognostic capability of our models, we calculated a confusion matrix for all test sets, based on the threshold for 100% specificity in the corresponding training set to display; True-positive (TP), true-negative (TN), false-positive (FP) and false-negative (FN) predictions. All p-values were two-tailed, and p < 0.05 were considered significant.
Statistical analyses were done in R (The R Foundation for Statistical Computing) and Python [27,28]. All ANN models were created using Tensorflow 2.0, an open-source framework developed by Google [29]. The 'Boruta-Shap' Python package was used for variable selection [26]. The post-hoc explanation of the ANN models was based on the 'shap' package in Python [30]. We used the 'pROC' and 'Optimalcutpoints' package in R when producing the ROC curves and calculating the threshold for the confusion matrix [31,32]. The TRIPOD statement was followed when writing this manuscript [33].

Results
We included 932 patients from the TTM-trial after excluding six patients due to missing outcomes and one patient due to missing values. Overall poor outcome (CPC 3-5) was found in 492 (53%) patients while good outcome (CPC 1-2) was found in 440 (47%) patients [20]. The population characteristics are shown in Additional file 1 : table s1A; Additional file 2: table s1b; Additional  file 3: table s1c; Additional file 4: table s1D (Supplement), which includes patients' background, prehospital and admission characteristics, standard ICU observations and treatment along with biomarkers obtained on day 1, 2, and 3. As shown in Fig. 1, we excluded patients who were deceased or woke up by 24 h, between 24 and 48 h, and between 48 and 72 h in our analysis of day 1, 2 and 3, respectively. As described in the methods section, three datasets were then created for each day based on the level of additional biomarkers. The number of patients in each dataset along with the number of variables are shown in Table 1.
All models from day 1 to day 3 showed good to excellent prognostic performance in predicting neurological outcome at six months (see Fig. 2). Using clinical variables (level A), the AUROC remained under 90% throughout the first three days of intensive care. Upon adding the clinically accessible biomarkers (level B), the AUROC increased from 82 to 94% (p < 0.01). For the model with research-grade biomarkers (level C), the prognostic performance was excellent from day 1 to day 3 with an AUROC at approximately 95% (see Table 1 and Fig. 2). In summary, adding clinically accessible biomarkers to the clinical variables in level B successively improved the prognostication, whereas levels A and C both showed similar results on all days. Furthermore, as seen in Fig. 2, the sensitivity was above 60% while retaining a specificity of 100% for level B at 72 h and level C at all time points.
Based on a threshold with 100% specificity (no false positives) for patients in the training set, false-positive predictions (predicted poor outcome, reported good outcome) occurred in two models in the test set; 0.7% (day 1 level C) and 1.2% (day 2 level A). The rate of falsenegative predictions (predicted good outcome, reported poor outcome) was high for the majority of the models but remained below 25% throughout day 1-3 when the research-grade biomarkers were included (level C) (see Table 1).
By applying the SHAP algorithm to our models, the predictions could be explained. This is illustrated in Fig. 3, where the ANN model for level C is used to predict the probability of a poor outcome in a patient where the SHAP algorithm is used to explain the prediction. In this example, the patient's age of 77 years increases the risk of a poor outcome, whereas the low levels of biomarkers (i.e. NFL and NSE) decrease the risk of a poor outcome. Furthermore, we used the SHAP algorithm to rank the importance of the variables for every model. In Fig. 4, the ten most important variables for level A, B and C on day 2 were ranked by variable importance (see Supplementary for the top ten rankings of variables for day 1 and 3). For level A, age and the dose of adrenaline during resuscitation were the variables carrying the most value. In level B and C, age was ranked as the third most important variable when more information was gained by biomarkers. The dose of adrenaline was reduced to the sixth most important variable and was not included in the top

Table 1 Overview and prognostic performance
Overview and prognostic performance of the ANN models during the first three days after ICU admission. In level A, we used all available data from the TTM-trial, in level B, we added clinically accessible biomarkers, and for level C we added research-grade biomarkers as well. The prognostic performance is displayed as the area under the receiver operating characteristic curve (AUROC) and using by a confusion matrix. Note the threshold for the confusion matrix was based on the threshold for 100% specificity in the training set. TN, true negative; TP, true positive; FN, false negative; TP, true positive

Model performance
Confusion matrix (test set)

Discussion
In this exploratory retrospective study, we found that models using clinical variables paired with biomarkers and machine learning appear promising in predicting  [3][4][5] and those with good outcome (CPC 1-2) at six-months on an independent test set. The rows represent the different levels of added biomarkers to the available standard intensive care unit (ICU) observations from the TTM-trial: None (level A), clinically accessible biomarkers (level B), and research-grade biomarkers (level C). The columns represent the timeline after admission to the ICU. For each ROC curve, the 95% CI was calculated for specificity at different levels of sensitivity and displayed as a blue 95 CI band. TTM, target temperature management. CI, confidence interval Fig. 3 Illustration of the impact of features for a patient-specific prediction. An example of patient-specific prediction using the Shapley additive explanations algorithm (SHAP). This patient is predicted to have a 23% risk of a poor outcome on day 1, and 18% and 13% risk of a poor outcome on days 2 and 3, respectively (using the level C model). The patient's age is a is the most important risk-increasing variable while the modest levels of biomarkers like NFL, NSE decrease the risk a poor outcome. TNT, Troponin-T (ng/L). NFL, Neurofilament light (ng/L). UCHL1, Ubiquitin carboxy-terminal hydrolase L1 (ng/L). NSE, Neuron-specific enolase (ng/ml.) GFAP, Glial fibrillary acidic protein (ng/L) , and the right panel shows level C (adding research-grade biomarkers). The ranking of the variables depend on which biomarkers were included in the model. For level A (no biomarkers), age and the dose of adrenaline during resuscitation were the variables carrying the most information. In level B and C, age was ranked as the third most important variable when more information was provided by biomarkers. The dose of adrenaline was reduced to the sixth most important variable when adding clinically accessible biomarkers and was not included in the top ten variables when including research-grade biomarkers as well. SHAP, Shapley additive explanations algorithm. CA, cardiac arrest. ROSC, return of spontaneous circulation. NSE, Neuron-specific enolase. BNP, brain natriuretic protein. S100B, S100 calcium-binding protein B. NFL, Neurofilament light. GFAP, Glial fibrillary acidic protein. Tau. UCHL1, Ubiquitin carboxy-terminal hydrolase L1 long-term neurological outcome in comatose patients post cardiac arrest. Using only clinical variables resulted in moderate predictive ability. However, when clinically accessible biomarkers such as NSE were added, the predictive capability improved over time, reaching an AUROC of 94%, which supports the role of NSE in current guidelines. As seen in Fig. 4, age, BNP, and platelets also contributed to the prediction on day 3 for level B. With research-grade biomarkers added, the prognostic capability was excellent with an AUROC of approximately 95% already evident after 24 h post cardiac arrest. Research-grade biomarkers, such as NFL and GFAP, carried the most robust predictive information, and over time they rendered clinical variables largely redundant with little additive value, as supported by our previous paper, showing a high AUC value (94%) of NFL alone [5]. Additional analysis of NFL alone showed excellent prognostic value with an AUROC of 92.3%, 92.3% and 92.5% on day 1,2 and 3, respectively (based on the test set) when using this study design.
The overall objective in cardiac arrest prognostication is to aim for zero false-positive predictions (predicted poor outcome, reported good outcome), which would result in the withdrawal of life-sustaining therapies in a patient who otherwise would have survived with a good outcome. The specificity is, therefore, arguably more important than sensitivity.
The latest treatment guidelines for comatose cardiac arrest survivors recommend multimodal prognostication to be performed at least 72 h after ROSC. Moseby-Knappe et al. found that the current 4-step algorithm for neurological prognostication after cardiac arrest recommended by the European Resuscitation Council (ERC) and the European Society of Intensive Care Medicine (ESICM) identified patients with poor outcome with a 39% sensitivity and 100% specificity [34]. These guidelines are based on the consensus opinion of leading experts in the field. In contrast to that study, the current findings suggest that even at 24 h following cardiac arrest the prognostic performance is excellent with an AUROC of 95% (95% CI: 92-98%) and sensitivity above 60% while retaining 100% specificity upon adding research-grade biomarkers. These results are mainly driven by the biomarker NFL and are similar to two previous studies, of which one is based on the same data (TTM-trial) [5,6]. In fact, level B on day 3 (72 h) as well as level C for all time points had a sensitivity above 60% while retaining 100% specificity, which is comparable with the performance reported using the ERC/ESICM guidelines, which included important prognostics such as SSEP, EEG and neuroradiological imaging. This is noteworthy, especially for level B on day 3, which is mainly driven by NSE values. Both NSE after 72 h and NFL at all time points have previously shown good to excellent performance individually with a sensitivity of 52% and 53-65% while retaining 100% specificity, respectively [5,35]. Our study differs from those studies by only including comatose patients.
To further investigate this, we based the prediction of poor outcome in the test set on the threshold of 100% specificity in the training set. This does not guarantee a zero false-positive rate in the test set, as it does not account for the lower bound in the confidence interval or outliers. Nevertheless, the false-positive predictions for the model, including research-grade biomarkers, were less than 1% on day 1, and 0% on days 2 and 3 with a reasonably low false-negative rate of 15-25%. Consequently, the specificity for level C on day 1 was not 100% based on the threshold in the training set. For the model including clinically accessible biomarkers on day 3, the false-positive predictions were 0% and the false-negative rate was also reasonably low (25%).
Deep learning algorithms are new for OHCA prognostication. In a study from 2019, Kwon et al. showed that a deep learning algorithm outperformed other types of supervised machine learning based on a validation set of 8,145 patients when predicting a poor outcome (CPC 3-5) on hospital discharge [19]. They reported an AÙROC of 95% with narrow confidence intervals based on information available at the time of ROSC only. Their study differs from ours as we looked at the six-month neurological outcome and focused on comatose patients admitted to the ICU, where the balance between good and poor outcome is around 50%.
To our knowledge, this is the first prediction study using cumulative data in the first three days of ICU admission and the first to combine the predictive capability of different groups of biomarkers by adding them to clinical variables. The cumulative approach is a natural step after testing biomarkers individually to understand how they are ranked and interact over time. The SHAP algorithm allows for each prediction to be explained, which can help both researchers and physicians in treating OHCA patients and understanding the dynamic between the variables better, both for the individual patient and as a group.
The ranking of the variables depends on which biomarkers were included in the models. As shown in Fig. 4, age and the dose of adrenaline during resuscitation were the variables carrying the most information on day 2 (level A). The reason why the dose of adrenaline plays such an important role could be related to its correlation to other predictors such as time to ROSC and initial cardiac rhythm. The biomarkers dominated the top ten ranked variables upon being added to the models. Age remained an important variable in all models because age itself is a risk factor and possibly due to its correlation with comorbidities.
There are several limitations to this investigation. In our previous study, which was based on admission data from the TTM-trial, ANN was superior to logistic regression in predicting long-term neurological outcome [20]. We chose the same approach in this study, well aware that the decreasing number of patients in the days following ICU admission may be too low for ANN models to perform well (see Fig. 1). With very strong biomarkers like NFL and a limited number of patients for model development, ANN might not be superior to logistic regression. Furthermore, the size of the dataset led to moderately wide confidence intervals in the test set and even though the training/test split was random, the size of the datasets made the test sets vulnerable to patient outliers and thereby affecting the model performance.
Moreover, differences in patient care during the last decade, from prehospital response to post-arrest care, could decrease the generalisability of the models, as our models are based on patient data from seven to ten years ago. There is, however, no new evidence-based therapies included in clinical practice during this period. Additionally, these models are based on a randomised trial which only included patients with OHCA of a presumed cardiac cause which could affect the generalisability.
Another limitation of this study is that none of the important prognostic examinations such as EEG, SSEP or neuroradiological imaging after admission were included. When only incorporating clinical variables (neither clinically accessible nor research-grade biomarkers), the prognostic performance did not improve in the first three days after ICU admission (see Fig. 2). When comparing the prognostic performance to prediction models based on merely prehospital and admission data, it seems little, if any, prognostic value is added after hospital admission using only clinical variables (approximately 20 variables in each model, see Table 1) [20,36,37]. The TTM risk score and the ANN model in our previous study were developed on the same population as this study and showed good prognostic performance with AUROC of 84.2% and 89.1%, respectively [20,36]. It is a noteworthy finding that the prognostic improvement is absent during the first three days after ICU admission. This underlines the need to use other prognostic tools like SSEP, EEG etc. when performing cardiac arrest prognostication in an ICU setting. Furthermore, it also pinpoints the uncertainty when using datasets of the size. When evaluating the model performance on the independent test set, the results can be affected by the training/ test split and be vulnerable to outliers. For example, the model performance was presumably better at the time of hospital admission than after 24 h without biomarkers [20]. This difference is important and must be kept in mind when discussing this approach to cardiac arrest prognostication.
From an ICU perspective, one of the strengths of this study was that we modified the study population and excluded patients who were either deceased or woke up during the first 72 h after ICU admission. This strategy distilled the dataset to those patients that were at risk of a poor prognosis at each time point. Without doing so, the prediction models would be falsely enhanced as we would be predicting patients who had already woken up.
To make these models ready for clinical implementation, external validation and the use of a larger population for model development is needed in order to minimise some of the limitations in this study.

Conclusion
In this exploratory study, ANNs provided good to excellent prognostic accuracy in predicting neurological outcome in comatose patients post OHCA using clinical variables and biomarkers from the first three days of intensive care. The models which included NSE after 72 h and NFL on all days showed promising prognostic performance.