Skip to main content


Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Assessment of performance of four mortality prediction systems in a Saudi Arabian intensive care unit

  • 9119 Accesses

  • 28 Citations



The purpose of this study is to assess the performance of Acute Physiology and Chronic Health Evaluation (APACHE) II, Simplified Acute Physiology Score (SAPS) II, Mortality Probability Model MPM II0 and MPM II24 systems in a major tertiary care hospital in Riyadh, Saudi Arabia.


The following data were collected prospectively on all consecutive patients admitted to the Intensive Care Unit between 1 March 1999 and 31 December 2000: demographics, APACHE II and SAPS II scores, MPM variables, ICU and hospital outcome. Predicted mortality was calculated using original regression formulas. Standardized mortality ratio (SMR) was computed with 95% confidence intervals (CI). Calibration was assessed by calculating Lemeshow–Hosmer goodness-of-fit C statistics. Discrimination was evaluated by calculating the Area Under the Receiver Operating Characteristic Curves (ROC AUC).


Predicted mortality by all systems was not significantly different from actual mortality [SMR for MPM II0: 1.00 (0.91–1.10), APACHE II: 1.00 (0.8–1.11), SAPS II: 1.09 (0.97–1.21), MPM II24 0.92 (0.82–1.03)]. Calibration was best for MPM II24 (C-statistic: 14.71, P = 0.06). Discrimination was best for MPM II0 (ROC AUC:0.85) followed by MPM II24 (0.84), APACHE II (0.83) then SAPS II (0.79).


In our ICU population: 1) Overall mortality prediction, estimated by standardized mortality ratio, was accurate, especially for MPM II0 and APACHE II. 2) MPM II24 has the best calibration. 3) SAPS II has the lowest calibration and discrimination. The local performance of MPM II24 in addition to its ease-to-use makes it an attractive model for mortality prediction in Saudi Arabia.


Mortality prediction systems have been advocated as means of evaluating the performance of intensive care units (ICUs) [1]. These systems allow adjustment to the severity of illness of the patient population. Acute Physiology and Chronic Health Evaluation (APACHE) II and Simplified Acute Physiology Score (SAPS) II measure severity of illness by a numeric score [2,3] based on physiologic variables selected because of their impact on mortality: the sicker the patient, the more deranged the values and the higher the score. The numeric scores are then converted into predicted mortality by using a logistic regression formula developed and validated on populations of ICU patients. Mortality Probability Models (MPM) II differ slightly in that they use categorical variables (with the exception of age) for mortality prediction [4].

Before the clinical application of any of these systems, they must be validated on the population under evaluation [5,6]. These systems have been assessed for validity in several countries [7,8,9]. We report here the result of our validation study of the four systems in ICU population in a tertiary care center in Saudi Arabia.


King Fahad National Guard Hospital is a 550-bed tertiary care center in Riyadh, Saudi Arabia. The 12-bed medical–surgical ICU has 600 admissions a year. The hospital also has a coronary care unit and a cardiac surgical intensive care unit. Patients admitted to these units were not included in the study. The unit is run by full-time intensivists and has 24-hour immediate access to other medical and surgical specialties. Our nurse-to-patient ratio is approximately 1:1.2. This high ratio has been maintained because of the high acuity of care. Our ICU database was established in March 1999 to record ICU admissions. The present study presents information on all consecutive admissions between 1 March 1999 and 31 December 2000. Data were collected by one of the inten-sivists (Y.A., S.H. or R.G.). To minimize variability in data collection, one physician (Y.A.) coordinated the overall process. In addition, a written reference with the definitions used in original articles was made. Patients aged 16 years or more were eligible for the study with the exception of burn and brain-dead patients. For patients admitted to the ICU more than once in the same hospitalization, data from the first admission were used. Approval from the hospital Ethics Committee was not required because the information had already been collected for clinical reasons.

The following data were collected: demographics, APACHE II and SAPS II scores, and MPM variables. MPM II0 data were obtained on all admissions, whereas MPM II24, APACHE II and SAPS II data were obtained on patients who stayed for 24 hours or more in ICU. APACHE II and SAPS II scores were calculated in accordance with the original methodology, using the worst physiologic values in the first ICU day [2,3]. The only exception was Glasgow Coma Score (GCS). Many of these patients were under the influence of sedation and the worst GCS would reflect the effect of sedation more than the true underlying mental status. We therefore used the worst GCS value for non-sedated patients and the pre-sedation score for patients under sedation, as described previously [4,10,11,12]. The main reason for ICU admission, whether the admission was after emergency surgery, and the presence of severe chronic illness were documented in accordance with the original definitions [2]. Postoperative patients with sepsis or cardiac arrest were included with non-operative patients with these conditions [2]. ICU and hospital length of stay (LOS) and lead time (the interval from hospital admission to ICU admission) were calculated. Vital status at discharge from the ICU and from the hospital was registered.

Predicted hospital mortality was calculated with the logistic regression formulas described originally [2,3,4]. Standardized mortality ratio (SMR) was calculated by dividing observed hospital mortality by the predicted hospital mortality. The 95% confidence intervals (CIs) for SMRs were calculated by regarding the observed mortality as a Poisson variable, then dividing its 95% CI by the predicted mortality [7].

Validation of the systems was tested by assessing calibration and discrimination. Calibration (the ability to provide risk estimate corresponding to the observed mortality) was assessed by calibration curves and the Lemeshow–Hosmer goodness-of-fit C-statistic [11]. Calibration curves were drawn by plotting predicted against actual mortality for groups of the patient population stratified by 10% increments of predicted mortality. To calculate the C-statistic, the study population was stratified into ten deciles with approximately equal numbers of patients. The predicted and actual number of survivors and non-survivors were compared statistically with the use of formal goodness-of-fit testing to determine whether or not the discrepancy was statistically insignificant (P > 0.05).

Discrimination was tested by receiver operating characteristic (ROC) curves and 2 × 2 classification matrices. ROC curves were constructed as a measure of assessing discrimination with 10% stepwise increments in predicted mortality [14,15]. The four curves were compared by computing the areas under the curves. Classification matrices were performed at decision criteria of 10%, 30% and 50%. Sensitivity, specificity, positive and negative predictive values and overall correct classification rate were calculated.

Minitab for Windows (Release 12.1, Minitab Inc.) was used to perform statistics. Continuous variables were expressed as means ± SD and were compared by standard t-test. Categorical values were expressed in absolute and relative frequencies and were analyzed by χ2 test. Linear regression and logistic regression analysis were used when appropriate. P ≤ 0.05 was considered significant.


During the study period there were 1084 admissions to the ICU. Excluded patients were 94 re-admissions, 6 brain-dead patients and 15 with incomplete data.

Patient population

The demographics of the 969 eligible patients are shown in Table 1. It is noteworthy that 32% of all patients had one or more severe chronic illnesses. Severe hepatic disease was the leading chronic illness, followed by immunosuppression, severe respiratory illness, renal illness and cardiovascular illness. Some patients had more than one severe chronic illness. In comparison with survivors, non-survivors were older, had a longer lead time and ICU LOS but shorter hospital LOS. They had higher APACHE II and SAPS II scores. The type of admission was more likely to be medical or emergency surgical in non-survivors, and to be elective surgical in survivors. Severe chronic illness was more common in non-survivors, especially liver disease.

Table 1 Patients' demographics

Mortality predicted by the four systems

Table 2 shows the actual mortality and predicted mortality by all four systems, and also SMRs and their 95% CIs. SMRs for all system were not significantly different from 1, indicating accurate overall mortality prediction.

Table 2 Mortalities predicted by the four systems

Predicted mortality by subgroups

When subgrouped by categories of main reasons for ICU admission (Table 3), SMRs by the four systems were not significantly different from 1, with few exceptions. MPM II0 and MPM II24 overestimated mortality for the category 'non-operative trauma' admissions, and SAPS II underestimated mortality for the category 'other medical'. When subgrouped by the source of admission, all SMRs were not significantly different from 1 with the exception of the underestimation of mortality of SAPS II for patients admitted from the floor.

Table 3 Standardized mortality ratios for the four systems by reason of admission and source of admission


Calibration curves are shown in Figure 1. Of the four systems, SAPS II stands out as deviating from the identity line in most of the strata, including those with large numbers of patients. The SAPS II curve shows an underestimation of mortality in low-risk patients and an overestimation of mortality in high-risk patients, leading to the skewed appearance of its calibration curve. The curves of the other three systems fell on the identity line in the strata with large number of patients and deviated in some other strata.

Figure 1

Calibration curves for the four mortality prediction systems.

The results of Lemeshow–Hosmer goodness-of-fit tests are shown in Table 4. The C-statistic was best for MPM II24 (14.71), with P = 0.06. For the other three systems, calibration tested by the C-statistic was poor. These results indicate that, of the four systems, MPM II24 had the least statistically significant discrepancy between predicted and observed mortality across the strata of increasing predicted mortality.

Table 4 Hosmer-Lemeshow goodness-of-fit tests


Figure 2 shows the receiver operating characteristic (ROC) curves for the four systems. The corresponding areas under the curves were as follows: MPM II0, 0.85; MPM II24, 0.84; APACHE II, 0.83; SAPS II, 0.79. These reflect the better discriminative power of the first three systems than that of SAPS II.

Figure 2

Receiver operating characteristic (ROC) curves for the four systems.

The results of the 2×2 classification matrix are shown in Table 5. The overall correct classification rate was highest for MPM II0 and lowest for SAPS II at the different cutoff points; MPM II24 and APACHE II had intermediate rates. This was consistent with the results of ROC curve testing.

Table 5 Classification matrix for the four mortality prediction systems

Correlation of predicted mortalities by the four systems

On the basis of linear regression analysis, mortalities predicted by all four systems correlated with each other (P < 0.001 for all combinations). The closest correlation was between MPM II0 and MPM II24 (r2 = 0.67) followed by APACHE II and SAPS II24 (r2 = 0.66), MPM II24 and SAPS II (r2 = 0.62), MPM II24 and APACHE II (r2 = 0.56), MPM II0 and SAPS II (r2 = 0.52) and MPM II0 and APACHE II (r2 = 0.48). Figure 3 shows plots for the highest and lowest correlations.

Figure 3

Plots of predicted mortality of the systems with the highest intersystem correlation (MPM II0 versus MPM II24, left) and the lowest intersystem correlation (MPM II0 versus APACHE II, right).

The effect of lead time and ICU LOS on hospital mortality

By univariate analysis, lead time was a significant predictor of hospital outcome (odds ratio 1.02, 95% CI 1.00–1.03 per day, P = 0.04). However, when adjusted to the severity of illness estimated from the mortality predicted by any of the four systems, lead time was not an independent predictor of hospital outcome. Similarly, ICU LOS was a significant predictor of hospital outcome by univariate analysis (odds ratio 1.02, 95% CI 1.00–1.03 per day, P = 0.01) but not when adjusted to severity of illness by multivariate analysis.


The main findings of this validation study on a Saudi Arabian ICU population can be summarized as follows: (1) overall mortality prediction, estimated by SMR, was reasonably accurate, especially for MPM II0 and APACHE II; (2) MPM II24 had the best calibration by C-statistic; (3) SAPS II had the lowest calibration and discrimination.

There is great international variability in patient mix and severity of illness [7,8,9,16,17,18]. Some of the differences are inherent in the patient population. For example, patients with cirrhosis have a high severity of illness and poor prognosis when admitted to the ICU [19,20,21]. Case mix is also affected by the type of hospital, for example whether it is primary or tertiary, or a transplant or trauma center. Patients referred from other hospitals have a higher severity of illness and mortality compared with direct admissions [22]. Other factors are practice-related. An important example is the 'do not-resuscitate' (DNR) practice [23,24]. Early (pre-ICU) determination of DNR status reduces the number of futile admissions to the ICU leading to a probable reduction in overall severity of illness. Another example is related to ICU bed availability. It has been documented that physicians tend to be more selective in their ICU admissions at times of bed shortages, with patients with higher severity of illness being admitted [25]. In our study, the relatively high level of severity of illness was probably related to a combination of all these factors. Being a tertiary care center, our hospital receives referrals from other hospitals, some of them directly to ICU (Table 2). Being a transplant center, we have a large population of end-stage liver disease. The concept of DNR is evolving in Saudi Arabia. Our hospital has been leading in the country in this field by establishing a written policy for DNR orders and raising the awareness among physicians about this issue. Nevertheless, there is room for improvement. Frequently the DNR status is discussed at a very late stage or is not discussed at all before admission to the ICU. Consequently, many desperately ill patients are admitted to the ICU with very high severity of illness and no meaningful chance of survival. The nursing shortage in our ICU led to selecting the sicker admissions and decreasing the number of elective admissions.

In this study we showed that the overall mortality prediction was accurate but calibration was inadequate. Potential reasons for insufficient calibration might include the following: (1) factors related to the calibration methodology itself; (2) reasons related to data collection, namely intra-observer and interobserver variability; (3) variability in GCS; (4) differences in case mix; (5) differences in DNR policies; and (6) differences in medical care.

The results of Lemeshow–Hosmer statistics are dependent not only on the calibration of the model but also on patient numbers and the distribution of the estimates [26]. The results of the calibration tests in our study might seem inconsistent with the overall estimate of mortality of the whole population and with the mortality estimates of the subgroups (Table 2). This apparent inconsistency is probably related to the distribution of estimates. The overestimation of mortality in certain strata of severity of illness is 'counterbalanced' by underestimation in other strata (Table 4), leading to 'perfect' estimation when the whole population or a specific subgroup is considered.

Interobserver variability in data collection has been documented in several studies [27,28,29]. This is potentially relevant to our study because data collection was performed by several physicians and over a relatively long period (22 months). We tried to minimize variability by having a written reference of definitions based on the original articles of the various scoring systems and having one person coordinate the process of data collection. Similar approaches have been shown to minimize variability [30,31]. MPM II systems have been documented to have high reproducibility, which might explain the better calibration of MPM II24 in our study [32].

The variability of GCS determination in sedated patients accounts for much of the variability in scoring APACHE II. Several approaches in determining GCS have been used previously. For non-sedated patients we used the worst value in the first 24 hours; for sedated patients we used the pre-sedation GCS. This approach has been shown to be associated with better performance of APACHE II than the approach that assumes normal GCS for sedated patients [10]. This approach also follows the original MPM II article definitions [4] and is consistent with the approach described by Knaus and others [11,12].

Another potential reason for the inadequate calibration is the differences in case mix between our database and the development databases of the mortality prediction systems. Medical patients constitute a larger proportion in our database (68%) than in the development databases (MPM II0, 45%; MPM II24, 48%; SAPS II, 49%; APACHE II, 58%) [2,3,4]. When the main diagnostic categories in our database are compared with those in the development database of APACHE II, some interesting differences appear. The post cardiac arrest category, which is associated with a high mortality risk (APACHE II diagnostic category weight is 0.393) accounts for 7% of our admissions, compared with 3% in the developmental database of APACHE II. The postoperative category 'peripheral vascular surgery', which is associated with a low mortality risk (APACHE II diagnostic weight is -1.315) accounts for 1.5% of our admissions, compared with 9.82% of the development database. Our database also has more severe chronic illnesses (32%) compared with 5–29% in different participating ICUs in the APACHE II development database. This is partly related to the high percentage of patients with end-stage liver disease admitted to our ICU (12.5% of all patients).

Differences in admission practises to the ICUs might also have an impact on ICU outcome. The delay in DNR orders is mentioned earlier. This factor probably contributes to our high severity of illness and could have affected system calibration.

Finally, one must examine whether the inadequate calibration is related to differences in medical care. However, in our study, overall actual mortality was not different from predicted mortality, as is evident in the SMRs (Table 2). Furthermore, when the calibration curves are examined there is no consistent pattern of overestimation or underestimation of mortality among the four systems in the different strata of severity of illness. This suggests that the inadequate calibration is inherent in these systems when applied to this population and less likely to be related to gross variation in medical care.

Our study has some limitations. First, it is a single-center study, which makes it biased towards a certain case mix. Second, as discussed above, collecting the data over a relatively long period by different physicians has implications for consistency of data collection. The use of a written reference of definitions and assigning a coordinator to oversee the whole process probably decreased variability, as has been shown previously [28,29]. In addition, variability has been found to be random and of little impact on the overall estimates [29]. A multicenter study would have addressed some of these concerns.

In conclusion, the four mortality prediction systems gave accurate overall estimates of mortality, especially MPM II0 and APACHE II. Calibration was modest for MPM II24 and inadequate for the others. SAPS II had the lowest calibration and discrimination. The local performance of MPM II systems (particularly MPM II24), in addition to their ease of use, makes them attractive models for use in Saudi Arabia. However, a multi-center national study is needed to confirm these findings.



APACHE = Acute Physiology and Chronic Health Evaluation


CI = confidence interval


DNR = 'do not resuscitate'


GCS = Glasgow Coma Score


ICU = intensive care unit


LOS = length of stay


MPM = Mortality Probability Model


ROC = receiver operating characteristic


SAPS = Simplified Acute Physiology Score


SMR = standardized mortality ratio.


  1. 1.

    Knaus WA, Wagner DP, Zimmerman JE, Draper EA: Variations of mortality and length of stay in intensive care units. Ann Intern Med 1993, 118: 753-761.

  2. 2.

    Knaus WA, Draper EA, Wagner DP, Zimmerman JE: APACHE II: a severity of disease classification system. Crit Care Med 1985, 13: 818-829.

  3. 3.

    Le Gall J-R, Lemeshow S, Saulnier F: A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multi center study. JAMA 1993, 270: 2957-2962. 10.1001/jama.270.24.2957

  4. 4.

    Lemeshow S, Teres D, Klar J, Avrunin JS, Gehlbach SH, Rapoport J: Mortality Probability Models (MPM II) based on an international cohort of intensive care unit patients. JAMA 1993, 270: 2478-2486. 10.1001/jama.270.20.2478

  5. 5.

    Lemeshow S, Le Gall J-R: Modeling the severity of illness of ICUpatients. JAMA 1994, 272: 1049-1055. 10.1001/jama.272.13.1049

  6. 6.

    Teres D, Lemeshow S: Why severity models should be used with caution. Crit Care Clin 1994, 10: 93-110.

  7. 7.

    Goldhill DR, Sumner A: Outcome of intensive care patients in a group of British intensive care units. Crit Care Med 1998, 26: 1337-1345. 10.1097/00003246-199808000-00017

  8. 8.

    Markgraf R, Deutschinoff G, Pientka L, Scholten T: Comparison of Acute Physiology and Chronic Health Evaluations II and III and Simplified Acute Physiology Score II: a prospective cohort study evaluating these methods to predict outcome in a German interdisciplinary intensive care unit. Crit Care Med 2000, 28: 28-33.

  9. 9.

    Parikh CR, Kanad DR: Quality, cost and outcome of intensive care in a public hospital in Bombay, India. Crit Care Med 1999, 27: 1754-1759. 10.1097/00003246-199909000-00009

  10. 10.

    Livingston BM, Mackenzie SJ, MacKirdy FN, Howie JC: Should the pre-sedation Glasgow Coma Scale value be used when calculating Acute Physiology and Chronic Health Evaluation scores for sedated patients? Crit Care Med 2000, 28: 389-394. 10.1097/00003246-200002000-00017

  11. 11.

    Knaus WA: Measuring the Glasgow Coma Scale in the intensive care unit: potentials and pitfalls. Intens Care World 1994, 11: 102-103.

  12. 12.

    Teoh LS, Gowardman JR, Larsen PD, Green R, Galletly DC: Glasgow Coma Scale: variation in mortality among permuta-tions of specific total scores. Intensive Care Med 2000, 26: 157-161. 10.1007/s001340050039

  13. 13.

    Lemeshow S, Hosmer DW: A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol 1982, 115: 92-106.

  14. 14.

    Metz CE: Basic principles of ROC analysis. Semin Nucl Med 1978, 8: 283-298.

  15. 15.

    Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143: 29-36.

  16. 16.

    Rowan KM, Kerr JH, Major E, McPherson K, Short A, Vessey MP: Intensive Care Society's APACHE II study in Britain and Ireland. I. Variations in case mix of adult admissions to general intensive care units and impact on outcome. Br Med J 1993, 307: 972-977.

  17. 17.

    Vincent JL, Thijs L, Cerny V: Critical care in Europe. Crit Care Clin 1997, 13: 245-254.

  18. 18.

    Zimmerman JE, Wagner DP: Prognostic systems in intensive care: how do you interpret an observed mortality that is higher than expected? Crit Care Med 2000, 28: 258-260. 10.1097/00003246-200001000-00048

  19. 19.

    Singh N, Gayowski T, Wagener MM, Marino IR: Outcome of patients with cirrhosis requiring intensive care unit support: prospective assessment of predictors of mortality. J Gastroenterol 1998, 33: 73-79. 10.1007/s005350050047

  20. 20.

    Shellman RG, Fulkerson WJ, DeLong E, Piantadosi CA: Prognosis of patients with cirrhosis and chronic liver disease admitted to the medical intensive care unit. Crit Care Med 1988, 16: 671-678.

  21. 21.

    Lee H, Hawker FH, Selby W, McWilliam DB, Herkes RG: Intensive care treatment of patients with bleeding esophageal varices: results, predictors of mortality, and predictors of the adult respiratory distress syndrome. Crit Care Med 1992, 20: 1555-1563.

  22. 22.

    Flabouris A: Patient referral and transportation to a regional tertiary ICU: patient demographics, severity of illness and outcome comparison with non-transported patients. Anaesth Intens Care 1999, 27: 385-390.

  23. 23.

    Alemayehu E, Molloy DW, Guyatt GH, Singer J, Penington G, Basile J, Eisemann M, Finucane P, McMurdo ME, Powell C, et al.: Variability in physicians' decisions on caring for chronically ill elderly patients: an international study. Can Med Assoc J 1991, 144: 1133-1138.

  24. 24.

    Rapoport J, Teres D, Lemeshow S: Resource use implications of Do Not Resuscitate orders for intensive care unit patients. Am J Respir Crit Care Med 1996, 153: 185-190.

  25. 25.

    Strauss MJ, Lo Gerfo JP, Yeltatzie JA, Temkin N, Hudson LD: Rationing of intensive care unit services. An everyday occurrence. JAMA 1986, 255: 1143-1146. 10.1001/jama.255.9.1143

  26. 26.

    Cook DA: Performance of APACHE III models in an Australian ICU. Chest 2000, 118: 1732-1738. 10.1378/chest.118.6.1732

  27. 27.

    Polderman KH, Thijs LG, Girbes AR: Interobserver variability in the use of APACHEII scores. Lancet 1999, 353: 380.

  28. 28.

    Goldhill DR, Summer A: APACHE II, data accuracy and outcome prediction. Anaesthesia 1998, 10: 937-943. 10.1046/j.1365-2044.1998.00534.x

  29. 29.

    Chen LM, Martin CM, Morrison TL, Sibbald WJ: Interobserver variability in data collection of the APACHE II score in teaching and community hospitals. Crit Care Med 1999, 27: 1999-2004. 10.1097/00003246-199909000-00046

  30. 30.

    Polderman KH, Jorna EM, Girbes AR: Inter-observer variability in APACHE II scoring: effect of strict guidelines and training. Intensive Care Med 2001, 27: 1365-1369. 10.1007/s001340101012

  31. 31.

    Polderman KH, Girbes AR, Thijs LG, Strack van Schijndel RJ: Accuracy and reliability of APACHE II scoring in two intensive care units: problems and pitfalls in the use of APACHE II and suggestions for improvement. Anaesthesia 2001, 56: 47-50. 10.1046/j.1365-2044.2001.01763.x

  32. 32.

    Rue M, Valero C, Quintana S, Artigas A, Alvarez M: Interobserver variability of the measurement of the mortality probability models (MPM II) in the assessment of severity of illness. Intensive Care Med 2000, 26: 286-291. 10.1007/s001340051151

Download references

Author information

Correspondence to Yaseen Arabi.

Additional information

Competing interests

None declared.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Arabi, Y., Haddad, S., Goraj, R. et al. Assessment of performance of four mortality prediction systems in a Saudi Arabian intensive care unit. Crit Care 6, 166 (2002).

Download citation


  • intensive care
  • mortality
  • prediction
  • severity of illness