Combining multiple ECG features does not improve prediction of defibrillation outcome compared to single features in a large population of out-of-hospital cardiac arrests

Introduction Quantitative electrocardiographic (ECG) waveform analysis provides a noninvasive reflection of the metabolic milieu of the myocardium during resuscitation and is a potentially useful tool to optimize the defibrillation strategy. However, whether combining multiple ECG features can improve the capability of defibrillation outcome prediction in comparison to single feature analysis is still uncertain. Methods A total of 3828 defibrillations from 1617 patients who experienced out-of-hospital cardiac arrest were analyzed. A 2.048-s ECG trace prior to each defibrillation without chest compressions was used for the analysis. Sixteen predictive features were optimized through the training dataset that included 2447 shocks from 1050 patients. Logistic regression, neural network and support vector machine were used to combine multiple features for the prediction of defibrillation outcome. Performance between single and combined predictive features were compared by area under receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and prediction accuracy (PA) on a validation dataset that consisted of 1381 shocks from 567 patients. Results Among the single features, mean slope (MS) outperformed other methods with an AUC of 0.876. Combination of complementary features using neural network resulted in the highest AUC of 0.874 among the multifeature-based methods. Compared to MS, no statistical difference was observed in AUC, sensitivity, specificity, PPV, NPV and PA when multiple features were considered. Conclusions In this large dataset, the amplitude-related features achieved better defibrillation outcome prediction capability than other features. Combinations of multiple electrical features did not further improve prediction performance.


Introduction
Early cardiopulmonary resuscitation (CPR) and early defibrillation are the key points in the chain of survival in cardiac arrest patients with shockable rhythms [1,2]. However, the priority of intervention, CPR or immediate defibrillation and the duration of CPR intervals prior to defibrillation are still debated, particularly in out-ofhospital cardiac arrests (OHCA) with long response times [3][4][5]. Animal studies demonstrated that high success of restoration of spontaneous circulation (ROSC) is achieved when the heart is recently perfused, while prolonged untreated ventricular fibrillation (VF) with depleted energy phosphates leads to poor outcome [6]. Clinical studies also indicated that not all VF patients benefit from being treated in the same manner with a time-based CPR/defibrillation protocols [2,7]. Optimizing timing of defibrillation might decrease the severity of postresuscitation myocardial dysfunction by reducing the numbers of failed shocks and by reducing the consequent unnecessary interruptions in chest compression, having therefore the potential to improve the final outcome of cardiac arrest [8].
Quantitative electrocardiogram (ECG) waveform analysis provides a noninvasive reflection of the metabolic status of the myocardium during resuscitation and is a potential tool to guide and optimize CPR interventions, i.e., chest compression or defibrillation [9]. During the last two decades, numerous features have been developed and used to predict the outcome of defibrillation, including time domain [10][11][12][13][14][15], frequency domain [15][16][17][18][19] and nonlinear measures [20,22]. As combining multiple predictive features may offer complementary information to improve the predictive accuracy [16], several studies have been attempted to combine different VF features to enhance the predictive performance using the machine learning theory, albeit in relatively small populations [25,25]. Whether the combination of multiple predictive features can improve prediction capability for defibrillation outcome compared to the single features is still uncertain.
The purpose of the present study was to investigate whether combination of multiple VF features, by different machine learning strategies, including logistical regression (LR), artificial neural network and support vector machine (SVM), could improve the prediction capacity of defibrillation outcome using a large multicenter database of OHCA patients.

Data sources
This study was approved by the ethics committee of the coordinating center, San Gerardo University Hospital, Monza, Italy. The institutional review board waived the requirement of informed consent since the data were already collected for administrative and statistical reasons by the National Health System.
A total of 3828 defibrillation shocks from 1617 patients who experienced OHCA were analyzed. The detailed descriptions of the multicenter database and population characteristics have been previously reported [18]. Data included a training set of 2447 defibrillations from 1050 patients and a validation set of 1381 defibrillations from 567 patients. All ECG data were digitally resampled at 250 Hz for compatibility with other studies. A 2.048-s episode (512 samples) free from chest compression was selected immediately prior to each defibrillation. Preprocessing of ECG data was executed by bandpass filters with different frequency ranges for baseline drifting removal and artifact attenuation.
Successful defibrillation was defined as the achievement of an organized rhythm with heart rate ≥ 40 beats/ min within 60 s postdefibrillation, while shocks resulting in VF, ventricular tachycardia (VT), asystole or pulseless electrical activity with pauses > 3 s were regarded as unsuccessful defibrillations [18]. In the training set, 641 defibrillations (26.2 %) were successful, while considering only the 1050 first defibrillation attempts, 278 (26.5 %) were successful. In the validation set, 445 defibrillations (32.1 %) were successful and 175 (31.0 %) were successful when only the first defibrillations were considered.

Predictive feature selection and optimization
Sixteen predictive features of ECG waveforms with good prediction power in previous clinical studies [12,26] were selected and calculated in this study. Table 1 presents their definitions and equations.
Optimum frequency range of bandpass filters for each feature was obtained with a criterion of maximum area under the receiver operating characteristic (ROC) curve (AUC) using the training data. The boundaries of the lower and upper frequencies for calculating the optimum frequency range were 2-5 Hz and 20-48 Hz, respectively.

Combination methods
Three different machine learning techniques, including LR, neural network and SVM were used to combine different VF features for the prediction of defibrillation outcome.

Logistic regression
In the LR model, optimal features (with p values all less than 0.0001) were automatically selected from the 16 features employing the training data by forward stepwise using the likelihood ratio test. The LR equation for prediction was 1 1þ exp −β 0 − X n β n y n ! , where β 0 is the regression constant and β n is the nth regression coefficient of the selected feature y n . The predictive values of validation data were obtained according to the corresponding LR equations by a threshold (for successful or unsuccessful decisions) with equal sensitivity and specificity for the training data.

Neural network
The back propagation (BP) neural network with a feed forward structure was used in the training set to achieve an optimal outcome. The training processing adopted the Bayesian regularization training function, two hidden layers, and sigmoid and linear transfer functions. All features in the training and validation sets were normalized by minus of mean and division of standard deviation values. The AUC of direct outcomes of the BP neural network was calculated as these outcomes were not binary decisions (0 or 1). For compatibility, a threshold with equal sensitivity and specificity for training data was used to result in a binary decision. Combination of all features (BP-C1), combination of features with a high Hurst index (Hu) with sampling rate f s , and mean value x . A j indicated the amplitude of Fourier transform of x(t) at frequency f j (j = 1,2,…, M). P x (f j ) specified samples power spectral density of x(t) at frequency f j . W x (c j ) represented samples of high-band coefficients of wavelet transform of x(t). L in PPA indicated L subintervals; L in SPE indicated L frequency bands. Function R(·) was taken as the difference between the maximum and minimum deviation from time period "i". Function S(·) calculated the standard deviation for time period "i" predictive power (with AUC > 0.8) (BP-C2) and combination of complementary features (correlation coefficient r < 0.3) (BP-C3) were tested by the BP neural network, respectively.

Support vector machine
In the SVM model, a Gaussian radial basis function was selected as the kernel function with an error penalty factor (C = 1) and a scaling factor (σ = 0.01).
Choosing small values for the error penalty factor and the scaling factor was intended to make the risk function of SVM have solutions for large training data. Combinations of all features (SVM-C1), high predictive power features (SVM-C2) and complementary features (SVM-C3) were also adopted in the training and validation processes of SVM.

Statistical analysis
The prediction power was assessed by ROC curves, AUC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and prediction accuracy (PA) [18,26]. For compatibility with the machine learning techniques, sensitivity, specificity, PPV, NPV and PA of single features for the validation data were calculated with a threshold in which sensitivity equaled to specificity for the training data. Pearson's correlation coefficients were calculated among single features for correlation analysis. AUCs were compared using Z-test. Chi-squared test was employed to distinguish differences among sensitivity, specificity, PPV, NPV and PA of the different predictive features. A final two-tailed p value < 0.05 was considered statistically significant.

Performance of single features
ROC curves and AUCs of the candidate features for all and the first defibrillations in training and validation datasets are reported in Fig. 1. All the 16 candidate VF features, except for peak frequency (PF), centroid frequency (CF), spectral flatness measure (SFM), and Hurst index (Hu), showed a high AUC, i.e., > 0.8. More specifically, mean slope (MS) and amplitude spectral area (AMSA) had the highest AUC values (0.876) for all defibrillations, while MS had the highest AUC value (0.873) for the first defibrillations in the validation set. Median slope (MdS), power spectrum analysis (PSA), average peak-to-peak amplitude (PPA), signal integral (SignInt), root mean square (RMS), amplitude range (AR), wavelet energy (WE) and energy (EG) also had an AUC value greater than 0.845 (p was not significant vs. MS for all and/or for the first defibrillations). Considering all the defibrillation attempts, AUCs for spectrum entropy (0.848, p = 0.024 vs. MS) and max power (0.847, p = 0.020 vs. MS) were relatively lower when compared with MS, but no significant differences were observed when the first defibrillations were considered. Additionally, AUCs for PF (0.619/0.607), CF (0.565/0.547), SFM (0.489/0.401) and Hu (0.478/0.445) were significantly lower compared with MS (p < 0.001), both for all and first defibrillations.
Correlation analysis demonstrated that most of the features were significantly correlated with each other ( Table 2). Amplitude-related features, such as MS, AMSA, MdS, SignInt, PSA, PPA, WE, AR and RMS were strongly correlated with each other (r > 0.807, p < 0.001). For frequency-related methods, CF was highly correlated with PF (r = 0.770, p < 0.001) and SFM (r = 0.829, p < 0.001). Poor correlations were observed among the other measures.

Performance of combined features
The performance of combined features in the validation set for all and first defibrillations are listed in Tables 3  and 4

Comparison between single and combined features with optimal performance
Since BP-C3 outperformed other combination strategies and MS had optimal performance among single feature methods, the prediction capacity between MS and BP-  (Table 3) and the first (Table 4) defibrillations.

Discussion
In the present study, we investigated whether combination of multiple VF features could improve the capability of defibrillation outcome prediction using a large multicenter database from OHCA patients by machine learning strategies. The results indicated that the amplitude-related features outperformed other single waveform measures, while combining multiple VF features did not further improve the capability of defibrillation prediction.
Accuracy in predicting defibrillation outcome during resuscitation of VF cardiac arrest patients provides the potential to significantly enhance resuscitative strategies and improve patient's outcome. A considerable number of defibrillation predictors have been proposed and shown to be promising in estimating VF duration, predicting defibrillation outcome, return to organized rhythm, and prognosticating long-term survival [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. Current best predictors achieve an AUC in predicting defibrillation outcome of 0.87, with a balanced sensitivity and specificity of approximately 80 %. The above Fig. 1 Receiver operating characteristic (ROC) curves and area under ROC curves (AUC) of 16 the predictive features for training and validation datasets. 1st first defibrillations, All all defibrillations, AMSA amplitude spectrum analysis, P-P amplitude average peak-peak amplitude, RMS root mean square, SFM spectral flatness measure, T training set, V validation set approaches have already a high predictive power; nevertheless research identifying approaches that might further improve the accuracy of defibrillation outcome prediction for OHCA is still ongoing. A possible solution is to use patient-specific information in the ECG-based prediction model. In an earlier study, Monsieurs et al. showed that adding age to the prediction formula increased the correct classification of survivors and nonsurvivors in 100 OHCA victims [27]. However, no significant improvement was obtained by including age, sex, presenting rhythm, presence of bystander CPR and ambulance response time when six different single prediction features were investigated in 530 shocks from 86 patients [28].
Another practical approach to improve the predictive performance of current ECG analysis is to combine multiple VF features using machine learning strategies. In a dataset of 883 defibrillations from 156 OHCA patients, Eftestøl et al. demonstrated that the combination of two decorrelated spectral features based on the principal  MS mean slope, AMSA amplitude spectral area, SignInt signal integral, PSA power spectrum analysis, PPA average peak-to-peak amplitude, Mds median slope, MP max power, PF peak frequency, CF centroid frequency, EG energy, SFM spectral flatness measure, WE wavelet energy, AR amplitude range, SPE spectrum entropy, RMS root mean square, Hu Hurst index *p < 0.05, **p < 0.01   [30]. In another clinical study, Neurauter et al. compared the performance of ten single predictive features and their combinations in 770 countershock attempts from 197 patients, and verified that combination of these predictive features using neural networks could not improve outcome prediction [25]. Recently, Shandilya et al. predicted defibrillation success using a parametrically optimized SVM model from a database of 90 precountershock ECG signals. The PA (82.2 % vs. 64.6 %) and AUC (0.850 vs. 0.609) were considerably improved by combining six to ten features compared with single feature-based AMSA [22]. Howe et al. investigated an alternative SVM-optimized classification approach, which combined multiple metrics with acceptable predictive attributes in a total of 115 defibrillations from 41 patients [24]. In contrast to the 86 % sensitivity and 60 % specificity for single feature AMSA, performance of the combined features was improved to a sensitivity of 87.6 % and a specificity of 71.6 % for the prediction of return of organized rhythm. Besides the differences in machine learning methods and feature selection [22][23][24][25], the relative smaller sample size and not multicenter data might be responsible for the controversial conclusions when multiple features were applied to predict defibrillation outcome. In previous clinical studies, data were usually split into training and validation sets to testify the performance of predictors or designed parameters [17,18,[22][23][24][25]. Switching the role of two sets by a crossvalidation method was frequently adopted to increase the degree of expected reliability in studies with relative smaller sample size [22][23][24][25]. Nevertheless, the test performances were considered in the design of the classifiers to optimize and generalize parameters [17]. Thereby, the crossvalidation strategy would influence the design process and bias the validation results.
Our results, obtained from the largest database of ECG traces on OHCA patients to date, showed that amplitude-related measures, such as MS, AMSA, MdS, PSA, PPA outperformed frequency and nonlinear-based methods when ranked by AUC and exhibited similar shock success prediction performance, consistent with the study of Wu and Firoozabadi et al [14,26]. However, combining multiple VF features did not further improve the capability of defibrillation prediction in comparison to single features. This result was consistent with Neurauter et al. [25] when neural network was used but was controversial to the study of Howe et al. [24] when SVM was applied to combine multiple features. Notably, limited clinical data (115 defibrillations from 41 patients) were used in a crossvalidation SVM approach by Howe et al. [24], which might have caused biased validation results. Moreover, SVM usually keeps a desirable predictive performance for a small number of samples, but a large number of samples with noise may cause overfitting and overspecialization during the training process of SVM and create a negative bias in accuracies when the validation data are passed through the model [31]. Though overfitting happened when using the neural network with multiple hidden layers as well, neural network seemed more robust than SVM for a large number of training samples, which was caused by the different optimization functions and output variable forms employed in these two machine learning methods [31]. The unimproved prediction power of multiple VF features may be due to the limited information obtained from ECG signals and indicates that various single VF features, such as MS and AMSA, already reached the maximum prediction power extractable from VF ECGs. Besides ECG waveform characteristics, outcome of defibrillation is related to other factors of patients, such as drug treatments, comorbidities, and Emergency Medical Systems (EMS) arrival time. Additional clinically relevant attributes, independent from ECG waveform metrics, such as end-tidal carbon dioxide, blood pressure, blood oxygen saturation and compression depths, might be considered to further improve prediction power [22]. From another point of view, the longitudinal ECG data often has repeated defibrillations on each patient. The treatment effects and relative changes of a certain predictive feature may enhance the prediction performance in some degree.
We recognized that several limitations need to be considered in the study. First, this was a retrospective study on prospectively collected data. Sixteen predictive features were calculated only during the predefibrillation hands-off time and not in real time during chest compression. Second, the successful defibrillation was defined as sustained ROSC, but long-term survival was not considered. Peri-arrest factors such as age, sex, presenting rhythm, EMS arrival time, drug treatments, comorbidities, were not analyzed in this study. Third, further studies including independent ECG waveform metrics should to be tested in future prospective evaluations.

Conclusions
In this large population of OHCA patients, amplituderelated features such as MS, AMSA and MdS, achieved better prediction power of defibrillation outcome than other features. Combining multiple electrical features did not further improve prediction performance in comparison to the single features.

Key messages
The electrical features obtained from ECG waveform are promising in prediction of defibrillation outcome. However, most of these features are highly correlated with each other.
The amplitude related features achieve better defibrillation outcome prediction capability than other features. Combinations of multiple electrical features using machine learning strategies does not further improve prediction performance.