Assessment of performance of four mortality prediction systems in a Saudi Arabian intensive care unit.

Introduction The purpose of this study is to assess the performance of Acute Physiology and Chronic Health Evaluation (APACHE) II, Simplified Acute Physiology Score (SAPS) II, Mortality Probability Model MPM II0 and MPM II24 systems in a major tertiary care hospital in Riyadh, Saudi Arabia. Methods The following data were collected prospectively on all consecutive patients admitted to the Intensive Care Unit between 1 March 1999 and 31 December 2000: demographics, APACHE II and SAPS II scores, MPM variables, ICU and hospital outcome. Predicted mortality was calculated using original regression formulas. Standardized mortality ratio (SMR) was computed with 95% confidence intervals (CI). Calibration was assessed by calculating Lemeshow–Hosmer goodness-of-fit C statistics. Discrimination was evaluated by calculating the Area Under the Receiver Operating Characteristic Curves (ROC AUC). Results Predicted mortality by all systems was not significantly different from actual mortality [SMR for MPM II0: 1.00 (0.91–1.10), APACHE II: 1.00 (0.8–1.11), SAPS II: 1.09 (0.97–1.21), MPM II24 0.92 (0.82–1.03)]. Calibration was best for MPM II24 (C-statistic: 14.71, P = 0.06). Discrimination was best for MPM II0 (ROC AUC:0.85) followed by MPM II24 (0.84), APACHE II (0.83) then SAPS II (0.79). Conclusions In our ICU population: 1) Overall mortality prediction, estimated by standardized mortality ratio, was accurate, especially for MPM II0 and APACHE II. 2) MPM II24 has the best calibration. 3) SAPS II has the lowest calibration and discrimination. The local performance of MPM II24 in addition to its ease-to-use makes it an attractive model for mortality prediction in Saudi Arabia.


Introduction
Mortality prediction systems have been advocated as means of evaluating the performance of intensive care units (ICUs) [1]. These systems allow adjustment to the severity of illness of the patient population. Acute Physiology and Chronic Health Evaluation (APACHE) II and Simplified Acute Physiol-ogy Score (SAPS) II measure severity of illness by a numeric score [2,3] based on physiologic variables selected because of their impact on mortality: the sicker the patient, the more deranged the values and the higher the score. The numeric scores are then converted into predicted mortality by using a logistic regression formula developed and validated on popu-lations of ICU patients. Mortality Probability Models (MPM) II differ slightly in that they use categorical variables (with the exception of age) for mortality prediction [4].
Before the clinical application of any of these systems, they must be validated on the population under evaluation [5,6]. These systems have been assessed for validity in several countries [7][8][9]. We report here the result of our validation study of the four systems in ICU population in a tertiary care center in Saudi Arabia.

Methods
King Fahad National Guard Hospital is a 550-bed tertiary care center in Riyadh, Saudi Arabia. The 12-bed medical-surgical ICU has 600 admissions a year. The hospital also has a coronary care unit and a cardiac surgical intensive care unit. Patients admitted to these units were not included in the study. The unit is run by full-time intensivists and has 24-hour immediate access to other medical and surgical specialties. Our nurse-to-patient ratio is approximately 1:1.2. This high ratio has been maintained because of the high acuity of care. Our ICU database was established in March 1999 to record ICU admissions. The present study presents information on all consecutive admissions between 1 March 1999 and 31 December 2000. Data were collected by one of the intensivists (Y.A., S.H. or R.G.). To minimize variability in data collection, one physician (Y.A.) coordinated the overall process. In addition, a written reference with the definitions used in original articles was made. Patients aged 16 years or more were eligible for the study with the exception of burn and brain-dead patients. For patients admitted to the ICU more than once in the same hospitalization, data from the first admission were used. Approval from the hospital Ethics Committee was not required because the information had already been collected for clinical reasons.
The following data were collected: demographics, APACHE II and SAPS II scores, and MPM variables. MPM II 0 data were obtained on all admissions, whereas MPM II 24 , APACHE II and SAPS II data were obtained on patients who stayed for 24 hours or more in ICU. APACHE II and SAPS II scores were calculated in accordance with the original methodology, using the worst physiologic values in the first ICU day [2,3]. The only exception was Glasgow Coma Score (GCS). Many of these patients were under the influence of sedation and the worst GCS would reflect the effect of sedation more than the true underlying mental status. We therefore used the worst GCS value for non-sedated patients and the pre-sedation score for patients under sedation, as described previously [4,[10][11][12]. The main reason for ICU admission, whether the admission was after emergency surgery, and the presence of severe chronic illness were documented in accordance with the original definitions [2]. Postoperative patients with sepsis or cardiac arrest were included with non-operative patients with these conditions [2]. ICU and hospital length of stay (LOS) and lead time (the interval from hospital admission to ICU admission) were calculated. Vital status at discharge from the ICU and from the hospital was registered.
Predicted hospital mortality was calculated with the logistic regression formulas described originally [2][3][4]. Standardized mortality ratio (SMR) was calculated by dividing observed hospital mortality by the predicted hospital mortality. The 95% confidence intervals (CIs) for SMRs were calculated by regarding the observed mortality as a Poisson variable, then dividing its 95% CI by the predicted mortality [7].
Validation of the systems was tested by assessing calibration and discrimination. Calibration (the ability to provide risk estimate corresponding to the observed mortality) was assessed by calibration curves and the Lemeshow-Hosmer goodnessof-fit C-statistic [11]. Calibration curves were drawn by plotting predicted against actual mortality for groups of the patient population stratified by 10% increments of predicted mortality. To calculate the C-statistic, the study population was stratified into ten deciles with approximately equal numbers of patients. The predicted and actual number of survivors and non-survivors were compared statistically with the use of formal goodness-of-fit testing to determine whether or not the discrepancy was statistically insignificant (P > 0.05).
Discrimination was tested by receiver operating characteristic (ROC) curves and 2 × 2 classification matrices. ROC curves were constructed as a measure of assessing discrimination with 10% stepwise increments in predicted mortality [14,15]. The four curves were compared by computing the areas under the curves. Classification matrices were performed at decision criteria of 10%, 30% and 50%. Sensitivity, specificity, positive and negative predictive values and overall correct classification rate were calculated.
Minitab for Windows (Release 12.1, Minitab Inc.) was used to perform statistics. Continuous variables were expressed as means ± SD and were compared by standard t-test. Categorical values were expressed in absolute and relative frequencies and were analyzed by χ 2 test. Linear regression and logistic regression analysis were used when appropriate. P ≤ 0.05 was considered significant.

Results
During the study period there were 1084 admissions to the ICU. Excluded patients were 94 re-admissions, 6 brain-dead patients and 15 with incomplete data.

Patient population
The demographics of the 969 eligible patients are shown in Table 1. It is noteworthy that 32% of all patients had one or more severe chronic illnesses. Severe hepatic disease was the leading chronic illness, followed by immunosuppression, severe respiratory illness, renal illness and cardiovascular illness. Some patients had more than one severe chronic illness. In comparison with survivors, non-survivors were older, had a longer lead time and ICU LOS but shorter hospital LOS. They had higher APACHE II and SAPS II scores. The type of admission was more likely to be medical or emergency surgical in non-survivors, and to be elective surgical in survivors. Severe chronic illness was more common in nonsurvivors, especially liver disease. Table 2 shows the actual mortality and predicted mortality by all four systems, and also SMRs and their 95% CIs. SMRs for all system were not significantly different from 1, indicating accurate overall mortality prediction.

Predicted mortality by subgroups
When subgrouped by categories of main reasons for ICU admission (Table 3), SMRs by the four systems were not significantly different from 1, with few exceptions. MPM II 0 and MPM II 24 overestimated mortality for the category 'non-operative trauma' admissions, and SAPS II underestimated mortality for the category 'other medical'. When subgrouped by the  source of admission, all SMRs were not significantly different from 1 with the exception of the underestimation of mortality of SAPS II for patients admitted from the floor.

Calibration
Calibration curves are shown in Figure 1. Of the four systems, SAPS II stands out as deviating from the identity line in most of the strata, including those with large numbers of patients. The SAPS II curve shows an underestimation of mortality in low-risk patients and an overestimation of mortality in highrisk patients, leading to the skewed appearance of its calibration curve. The curves of the other three systems fell on the identity line in the strata with large number of patients and deviated in some other strata.
The results of Lemeshow-Hosmer goodness-of-fit tests are shown in Table 4. The C-statistic was best for MPM II 24 (14.71), with P = 0.06. For the other three systems, calibration tested by the C-statistic was poor. These results indicate that, of the four systems, MPM II 24 had the least statistically significant discrepancy between predicted and observed mortality across the strata of increasing predicted mortality. Figure 2 shows the receiver operating characteristic (ROC) curves for the four systems. The corresponding areas under the curves were as follows: MPM II 0 , 0.85; MPM II 24 , 0.84; APACHE II, 0.83; SAPS II, 0.79. These reflect the better discriminative power of the first three systems than that of SAPS II.

Discrimination
The results of the 2 × 2 classification matrix are shown in

Correlation of predicted mortalities by the four systems
On the basis of linear regression analysis, mortalities predicted by all four systems correlated with each other (P < 0.001 for all combinations). The closest correlation was between MPM II 0 and MPM II 24 (r 2 = 0.67) followed by APACHE II and SAPS II 24 (r 2 = 0.66), MPM II 24 and SAPS II (r 2 = 0.62), MPM II 24 and APACHE II (r 2 = 0.56), MPM II 0 and SAPS II (r 2 = 0.52) and MPM II 0 and APACHE II (r 2 = 0.48). Figure 3 shows plots for the highest and lowest correlations.

The effect of lead time and ICU LOS on hospital mortality
By univariate analysis, lead time was a significant predictor of hospital outcome (odds ratio 1.02, 95% CI 1.00-1.03 per day, P = 0.04). However, when adjusted to the severity of illness estimated from the mortality predicted by any of the four systems, lead time was not an independent predictor of hospital outcome. Similarly, ICU LOS was a significant predictor of hospital outcome by univariate analysis (odds ratio 1.02, 95% CI 1.00-1.03 per day, P = 0.01) but not when adjusted to severity of illness by multivariate analysis.

Discussion
The main findings of this validation study on a Saudi Arabian ICU population can be summarized as follows: (1) overall mortality prediction, estimated by SMR, was reasonably accurate, especially for MPM II 0 and APACHE II; (2) MPM II 24 had the best calibration by C-statistic; (3) SAPS II had the lowest calibration and discrimination.
Available online http://ccforum.com/content/6/2/166 Table 3 Standardized There is great international variability in patient mix and severity of illness [7][8][9][16][17][18]. Some of the differences are inherent in the patient population. For example, patients with cirrhosis have a high severity of illness and poor prognosis when admitted to the ICU [19][20][21]. Case mix is also affected by the type of hospital, for example whether it is primary or tertiary, or a transplant or trauma center. Patients referred from other hospitals have a higher severity of illness and mortality compared with direct admissions [22]. Other factors are practice-related. An important example is the 'do not-resuscitate' (DNR) practice [23,24]. Early (pre-ICU) determination of DNR status reduces the number of futile admissions to the ICU leading to a probable reduction in overall severity of illness. Another example is related to ICU bed availability. It has been documented that physicians tend to be more selective in their ICU admissions at times of bed shortages, with patients with higher severity of illness being admitted [25]. In our study, the relatively high level of severity of illness was probably related to a combination of all these factors. Being a tertiary care center, our hospital receives referrals from other hospitals, some of them directly to ICU (

Figure 1
Calibration curves for the four mortality prediction systems.
admission to the ICU. Consequently, many desperately ill patients are admitted to the ICU with very high severity of illness and no meaningful chance of survival. The nursing shortage in our ICU led to selecting the sicker admissions and decreasing the number of elective admissions.
In this study we showed that the overall mortality prediction was accurate but calibration was inadequate. Potential reasons for insufficient calibration might include the following: (1) factors related to the calibration methodology itself; (2) reasons related to data collection, namely intra-observer and interobserver variability; (3) variability in GCS; (4) differences in case mix; (5) differences in DNR policies; and (6) differences in medical care.
The results of Lemeshow-Hosmer statistics are dependent not only on the calibration of the model but also on patient numbers and the distribution of the estimates [26]. The results of the calibration tests in our study might seem inconsistent with the overall estimate of mortality of the whole population and with the mortality estimates of the subgroups (Table 2). This apparent inconsistency is probably related to the distribution of estimates.
The overestimation of mortality in certain strata of severity of illness is 'counterbalanced' by underestimation in other strata (Table 4), leading to 'perfect' estimation when the whole population or a specific subgroup is considered.
Interobserver variability in data collection has been documented in several studies [27][28][29]. This is potentially relevant Available online http://ccforum.com/content/6/2/166 to our study because data collection was performed by several physicians and over a relatively long period (22 months). We tried to minimize variability by having a written reference of definitions based on the original articles of the various scoring systems and having one person coordinate the process of data collection. Similar approaches have been shown to minimize variability [30,31]. MPM II systems have been documented to have high reproducibility, which might explain the better calibration of MPM II 24 in our study [32].
The variability of GCS determination in sedated patients accounts for much of the variability in scoring APACHE II. Several approaches in determining GCS have been used previously. For non-sedated patients we used the worst value in the first 24 hours; for sedated patients we used the presedation GCS. This approach has been shown to be associated with better performance of APACHE II than the approach that assumes normal GCS for sedated patients [10]. This approach also follows the original MPM II article definitions [4] and is consistent with the approach described by Knaus and others [11,12].
Another potential reason for the inadequate calibration is the differences in case mix between our database and the development databases of the mortality prediction systems. Medical patients constitute a larger proportion in our database (68%) than in the development databases (MPM II 0 , 45%; MPM II 24 , 48%; SAPS II, 49%; APACHE II, 58%) [2][3][4]. When the main diagnostic categories in our database are compared with those in the development database of APACHE II, some interesting differences appear. The post cardiac arrest category, which is associated with a high mortality risk (APACHE II diagnostic category weight is 0.393) accounts for 7% of our admissions, compared with 3% in the developmental database of APACHE II. The postoperative category 'peripheral vascular surgery', which is associated with a low mortality risk (APACHE II diagnostic weight is Critical Care April 2001 Vol 6 No 2 Arabi et al.

Figure 2
Receiver operating characteristic (ROC) curves for the four systems. Plots of predicted mortality of the systems with the highest intersystem correlation (MPM II 0 versus MPM II 24 , left) and the lowest intersystem correlation (MPM II 0 versus APACHE II, right).
-1.315) accounts for 1.5% of our admissions, compared with 9.82% of the development database. Our database also has more severe chronic illnesses (32%) compared with 5-29% in different participating ICUs in the APACHE II development database. This is partly related to the high percentage of patients with end-stage liver disease admitted to our ICU (12.5% of all patients).
Differences in admission practises to the ICUs might also have an impact on ICU outcome. The delay in DNR orders is mentioned earlier. This factor probably contributes to our high severity of illness and could have affected system calibration.
Finally, one must examine whether the inadequate calibration is related to differences in medical care. However, in our study, overall actual mortality was not different from predicted mortality, as is evident in the SMRs (Table 2). Furthermore, when the calibration curves are examined there is no consistent pattern of overestimation or underestimation of mortality among the four systems in the different strata of severity of illness. This suggests that the inadequate calibration is inherent in these systems when applied to this population and less likely to be related to gross variation in medical care.
Our study has some limitations. First, it is a single-center study, which makes it biased towards a certain case mix. Second, as discussed above, collecting the data over a rela-tively long period by different physicians has implications for consistency of data collection. The use of a written reference of definitions and assigning a coordinator to oversee the whole process probably decreased variability, as has been shown previously [28,29]. In addition, variability has been found to be random and of little impact on the overall estimates [29]. A multicenter study would have addressed some of these concerns.
In conclusion, the four mortality prediction systems gave accurate overall estimates of mortality, especially MPM II 0 and APACHE II. Calibration was modest for MPM II 24 and inadequate for the others. SAPS II had the lowest calibration and discrimination. The local performance of MPM II systems (particularly MPM II 24 ), in addition to their ease of use, makes them attractive models for use in Saudi Arabia. However, a multicenter national study is needed to confirm these findings.