Can generic paediatric mortality scores calculated 4 hours after admission be used as inclusion criteria for clinical trials?

Introduction Two generic paediatric mortality scoring systems have been validated in the paediatric intensive care unit (PICU). Paediatric RISk of Mortality (PRISM) requires an observation period of 24 hours, and PRISM III measures severity at two time points (at 12 hours and 24 hours) after admission, which represents a limitation for clinical trials that require earlier inclusion. The Paediatric Index of Mortality (PIM) is calculated 1 hour after admission but does not take into account the stabilization period following admission. To avoid these limitations, we chose to conduct assessments 4 hours after PICU admission. The aim of the present study was to validate PRISM, PRISM III and PIM at the time points for which they were developed, and to compare their accuracy in predicting mortality at those times with their accuracy at 4 hours. Methods All children admitted from June 1998 to May 2000 in one tertiary PICU were prospectively included. Data were collected to generate scores and predictions using PRISM, PRISM III and PIM. Results There were 802 consecutive admissions with 80 deaths. For the time points for which the scores were developed, observed and predicted mortality rates were significantly different for the three scores (P < 0.01) whereas all exhibited good discrimination (area under the receiver operating characteristic curve ≥0.83). At 4 hours after admission only the PIM had good calibration (P = 0.44), but all three scores exhibited good discrimination (area under the receiver operating characteristic curve ≥0.82). Conclusions Among the three scores calculated at 4 hours after admission, all had good discriminatory capacity but only the PIM score was well calibrated. Further studies are required before the PIM score at 4 hours can be used as an inclusion criterion in clinical trials.


Introduction
Adjustment to severity is considered important in clinical trials for ensuring comparability between groups. Generic mortality scoring systems for children admitted to intensive care units (ICUs) have been developed for use at specific time points in the ICU stay. Two systems have been validated in paediatric ICUs (PICUs): the Paediatric RISk of Mortality (PRISM) and the Paediatric Index of Mortality (PIM). The PRISM, which is used in PICUs worldwide, requires an observation period of 24 hours [1], and the updated PRISM III score [2] measures severity at two time points (12 and 24 hours) during the PICU stay. The PIM and the recently updated PIM2 scores are calculated 1 hour after admission [3,4]. The 12-24 hour period of observation has been a criticism levelled at the PRISM scoring system, and it has been speculated that it may diagnose rather than predict death [4,5]. With the PIM and PIM2 scores, the single measurement of values shortly after admission is susceptible to random variation [6] or may reflect a transient state resulting from interventions during transport [7].
Severity models have been used for time periods different from those for which the scores were developed [8]. In children with meningococcal septic shock, Castellanos-Ortega and coworkers [9] recorded the worst values for each variable included in the Glasgow Meningococcal Septicaemia Prognostic Score, the Malley score, and the PIM score over the first 2 hours in the PICU. Indeed, early identification of patients who could benefit from therapeutic interventions may be useful [9].
We hypothesized that an intermediate observation period (we arbitrarily chose a time point of 4 hours after PICU admission) would be a good compromise between two objectives -to take into account a short period of stabilization after a patient's admission to the PICU and to obtain an accurate measure of illness severity in the PICU. To our knowledge, no study has ever evaluated the accuracy of generic paediatric scoring systems in predicting death for the whole PICU population, and for time periods different from those for which the scores were developed.
The aim of the present study was to externally validate the PRISM, PRISM III and PIM scores at their intended time points, and to compare their accuracy in predicting mortality at those times with their accuracy at a different time period, namely 4 hours after admission.

Methods
All consecutive patients admitted to our university hospital PICU from June 1998 through to May 2000 were included unless they met the following exclusion criteria: admission in a state requiring cardiopulmonary resuscitation without achieving stable vital signs for at least 2 hours; admission for scheduled procedures normally done in other hospital wards; prematurity; and age more than 18 years.
Standard documentation and training were provided to all PICU medical staff. Data were prospectively collected to generate scores and predictions for the time periods for which the scores were developed (i.e. PIM at 1 hour, PRISM at 24 hours, PRISM III at 12 hours, and PRISM III at 24 hours) and to generate scores and predictions for a different time point (i.e. 4 hours after admission) [1,2,4]. The PIM2 score was not evaluated because it had not yet been reported when we began the study. The outcome measure was death or survival at dis-charge from the PICU. The probabilities of death were calculated at different time points (Table 1). To generate a prediction for the PRISM III 4-hour score, we used the PRISM III 12-hour equation (1996 version). In order to compare observed with expected mortality and to estimate the calibration of the scores, a Hosmer-Lemeshow goodness-of-fit test with five degrees of freedom (df; we considered five classes of mortality probability: 0% to <1%, 1% to <5%, 5% to <15%, 15% to <30%, and ≥30%) was performed [1]. According to this test, the P value is greater than 0.05 if the model is well calibrated; the greater the P value, the better the model fits [10].
The areas under the receiver operating characteristic curve (AUCs) and their standard errors were calculated to estimate the discrimination of the scores. An AUC ≥0.7 is generally considered acceptable, ≥0.8 as good, and ≥0.9 as excellent [11,12]. Standardized mortality ratios (SMRs) and their comparison to 1 were calculated [13]. To study the effect of length of stay on calibration and discrimination of the three scores, calibration was calculated each day and discrimination at days 5, 10 and 20 after admission. For a patient who had died on day x , the PICU outcome-dependent variable was considered as survival at day x-i (i = 1, 2, 3...).
Statistical analyses were performed using the Statistical Program for Social Science (SPSS Inc., Chicago, IL, USA).
For the time periods for which the scores were developed, the three scores had poor calibration (P < 0.01 for each), with large differences between the χ 2 goodness-of-fit test values ( Table 2). We observed underestimations of mortality in the low mortality risk groups (risk 1% to <5% and risk 5% to <15%), and an overestimation in the group with very high risk for mortality (risk ≥30%). SMRs varied from 1.03 to 1.39, but only the SMR for the PIM 1-hour assessment was significantly greater than 1 ( Table 3). All scores exhibited good discrimination (Table 2).
At 4 hours PIM had good calibration (P = 0.44). Conversely, both PRISM and PRISM III had poor calibration at 4 hours (P < 0.01), with significant differences between observed and predicted mortality (Tables 4 and 5). Expected mortality with PRISM and PRISM III underestimated the observed mortality in the groups at low risk for mortality. SMRs varied from 1.17 to 1.57, and were significantly greater than 1 except for the PIM 4-hour assessment ( Table 5). All scores exhibited good discrimination (Table 4).
For the time points for which the scores were developed, study of the length of stay showed good calibration for the PIM 1-hour assessment between days 3 and 28, for the PRISM 24hour assessment between days 11 and 22, for the PRISM III 12-hour assessment between days 51 and 58, and for the PRISM III 24-hour assessment between days 10 and 11 ( Fig.  1a). For the different time point examined (i.e. 4 hours), study of the length of stay showed good calibration for PIM from day 4 until discharge, for PRISM between days 2 and 15, and for PRISM III between days 3 and 10 ( Fig. 1b). For both time periods, study of the length of stay showed that the AUC for all scores, both for the time points for which the scores were developed (Fig. 2a) and at 4 hours ( Fig. 2b), exceeded 0.80.
With regard to the poor calibration identified in some of the assessments, retrospective analysis of patients who died was performed for the classes of mortality probability for which the χ 2 value exceeded 2.5. A χ 2 value of 11 was needed to obtain statistical calibration with the five classes of mortality probability. For these deceased patients we analyzed length of stay and comorbidities (cancer, prematurity, and chronic cardiac, respiratory, neurological and digestive diseases). Chronic organ disease was defined as disease with or without organ failure, requiring multiple admissions (to paediatric department or day care center) and requiring supervision by a subspecial-ist in paediatrics. A χ 2 value over 2.5, which indicates a significant difference between observed and predicted probability of death in a mortality class, was observed for 55 deceased patients. In this subpopulation, the median length of stay was significantly different from that in the other 25 deceased patients (7 days versus 1 day, respectively; P < 0.001), and only seven (13%) had a pre-ICU cardiac massage versus 18 (72%) in the other deceased patients (P < 0.000001). In these 55 patients, only 6-11% of the above mentioned comorbidities were taken into account in the probability of death calculated with the different scores.

Discussion
In this single unit study, discrimination of the PIM, PRISM and PRISM III scores was good whereas calibration was poor for the time points for which the scores were developed. At 4 hours, only the PIM score had good discrimination and calibration.
Both discrimination and calibration must be considered when evaluating the performance of scoring systems [14].
Discrimination measures the predictive performance of scoring systems, and when the outcome is dichotomous it is usually described by a receiver operating characteristic curve. In the studies that compared the original PIM, PRISM and PRISM III scores, the AUCs were as follows: ≥0.7 for the PIM and PRISM III scores [15]; ≥0.8 for the PIM score, and ≥0.9 for the PRISM and PRISM III scores [16]; and between 0.83 and 0.87 for the pre-ICU PRISM, PIM and PRISM scores [5]. Those findings are similar to ours. However, for Zhu and coworkers [17] AUC was not as sensitive to differences in ICU care as the Hosmer-Lemeshow goodness-of-fit test.
Gemke and van Vught [15] [16], calibration of the PRISM score could be expected to be poor.
The previously reported miscalibration of the PRISM score [22,23] led Tilford and coworkers [24] to use a different set of coefficient estimates. When interpreting the calibration of the PRISM III score, the version selected must be considered. In the present study the PRISM III score was calculated using the 1996 version and not the 1999 one, which includes other variables that are not described in the first PRISM III report and, to our knowledge, have not been reported elsewhere [2].
In our study, as in that by Gemke and van Vught [15], the expected mortality underestimated the observed mortality in the group at low risk for mortality and overestimated it in the group at very high risk for mortality (>30%). Such discrepancies have been reported with both paediatric [23] and adult [25] generic scoring systems.
The length of stay was studied by Bertolini and coworkers [23] because the PRISM score could not correctly predict outcome. Those authors found a good calibration for patients with a length of stay of 4 days or less and a poor calibration in those patients who stayed for longer than 4 days. The present study showed that, for the time periods for which the scores were developed, the PIM score provided the earliest (from day 3) and longest (to day 28) calibration. For a different time point *Significantly greater than 1 (P < 0.0001 for PRISM at 4 hours and P = 0.0025 for PRISM III at 4 hours) [13]. CI, confidence interval; PIM, Paediatric Index of Mortality [4]; PRISM, Paediatric RISk of Mortality [1,2]; SMR, standardized mortality ratio.

Figure 1
Effect of the length of stay on calibration of the Paediatric Index of Mortality (PIM) [4], Paediatric RISk of Mortality (PRISM) and PRISM III scores [1,2]  (i.e. 4 hours), the three scores were calibrated after a few days: day 2 for the PRISM 4-hour assessment, day 3 for the PRISM III 4-hour assessment, and day 4 for PIM 4-hour assessment; only the PIM 4-hour assessment was calibrated until discharge.
Moreover, patient mortality is affected by demographical, physiological and diagnostic data, but it also depends on many other factors such as comorbidities, which did not appear to be accounted for sufficiently in our population. In the recently reported PIM2 [3], the numbers of diagnostic criteria (high risk and low risk diagnosis) and comorbidities have been increased. Discrepancies between discrimination and calibration have previously been discussed. In fact, PRISM score, Acute Physiology and Chronic Health Evaluation (APACHE) score, Mortality Probability Model (MPM) score and Simplified Acute Physiology Score (SAPS) were reported in several studies to exhibit good discrimination but poor calibration [23,[25][26][27][28][29]. Unsatisfactory calibration of scores can be attributed to various factors, including poor performance of the medical system (if observed mortality is greater than predicted mortality) [23,25], differences in case mix [27] and mortality rate [30], as well as failure of the score equation to model the actual situation accurately [25].
The above mentioned paediatric studies did not give any information on the childrens' characteristics (case mix), which potentially could explain discrepancies between discrimination and calibration [2,15,23]. Indeed, the two studies using the additional variables of the PRISM III score [2,15] did not provide a clear description of their population. Important differences in case mix data are represented by mortality rates, which were different between PICUs (e.g. 4.8% for Pollack and coworkers [2], 6.6% for Gemke and van Vught [15] and 10.0% in the present study). The further the hospital mortality rate diverged from the original rate, the worse the performance of the model [17]. Goodness-of-fit tests are more sensitive than AUCs [17], and it has been suggested that, in the presence of good discrimination, bad calibration due to the source is correctable by using customization [31,32]. However, Diamond [33] demonstrated that perfect calibration and perfect discrimination cannot coexist; a perfectly calibrated model is not perfectly discriminatory because it has an AUC of only 0.83 rather than 1. Customization of a score is justified when the database on which it was developed is old and when the score is used in a specific population [24]. However, customization by a unit could lead to inability to evaluate (or compare) performance between units.
Is a score with poor calibration useful? If scores are used to assess quality of care, as estimated by SMR, then calibration, rather than discrimination, is the best measure of performance. It is also recognized that there are no formal means of directly comparing the χ 2 values derived from the goodness-of-fit test [30]. Our data and those reported by Livingston and coworkers [30] showed large differences in χ 2 goodness-of-fit test values between several scores. Thus, one can consider that a way to describe calibration of a score is to detail the χ 2 goodness-of-fit test values for different classes of mortality probability, which reflects exact prediction across the full range of severity (Tables 3 and 5) [18,20].
Stratification for inclusion of children in clinical trials remains an important problem in PICUs [6]. Scoring systems are used to compare or control for severity of illness in clinical trials and have been integrated into guidelines [6]. The question is, what kind of scoring system do we need if we are to include children in clinical trials? We probably need a score that represents well the patient's condition early after admission to the PICU. With this aim in mind, the PIM score appears superior to the PRISM and PRISM III scores. PIM score takes into account the condition of the patient directly on arrival in the PICU (i.e. when the patient's condition is least affected by therapeutic intervention). PRISM score require an observation period of 24 hours, which represents a limitation of its use as an inclusion criterion in clinical trials. To date, no consensus has been reached as to which approach represents the 'gold standard' [7]. In order Effect of the length of stay on discrimination of the Paediatric Index of Mortality (PIM) [ to minimize inclusion delay, Pollack and coworkers [2] proposed estimation of the probability of death using the PRISM III calculated 12 hours after admission. However, this delay is too long for serious diseases (e.g. meningococcal septic shock). In the present study the performance of PIM at 4 hours was better than at 1 hour. Thus, a 4-hour observation period seems to be a good compromise, allowing evaluation of the patient's clinical condition and permitting stabilization, without delaying inclusion in a therapeutic trial. We arbitrarily chose a period of 4 hours after PICU admission. Calculation of the scores at 3 or 5 hours would probably have yielded similar results.
To our knowledge, no study has compared the performance of generic paediatric mortality scores calculated within a few hours of admission to the PICU. Castellanos-Ortega and coworkers [9] used a similar approach in a specific population of children with meningococcal septic shock by calculating one generic (PIM) and two specific scores 2 hours after PICU admission; the PIM 2-hour score was as discriminant (AUC 0.82) as their new score (AUC 0.92; P = 0.10) but exhibited poor calibration.

Conclusion
The present study indicates that, among generic scores calculated at 4 hours after admission and with good discriminatory capacity (i.e. AUC > 0.80), only the PIM 4-hour score was well calibrated. The updated PIM2, which takes into account new primary reasons for ICU admission and comorbidities, must be validated for the time point for which it was developed and at a different time point. Further studies are required before the PIM (or PIM2) 4-hour score can be used as an inclusion criterion for clinical trials.