Performance of the score systems Acute Physiology and Chronic Health Evaluation II and III at an interdisciplinary intensive care unit, after customization.

BACKGROUND
Mortality predictions calculated using scoring scales are often not accurate in populations other than those in which the scales were developed because of differences in case-mix. The present study investigates the effect of first-level customization, using a logistic regression technique, on discrimination and calibration of the Acute Physiology and Chronic Health Evaluation (APACHE) II and III scales.


METHOD
Probabilities of hospital death for patients were estimated by applying APACHE II and III and comparing these with observed outcomes. Using the split sample technique, a customized model to predict outcome was developed by logistic regression. The overall goodness-of-fit of the original and the customized models was assessed.


RESULTS
Of 3383 consecutive intensive care unit (ICU) admissions over 3 years, 2795 patients could be analyzed, and were split randomly into development and validation samples. The discriminative powers of APACHE II and III were unchanged by customization (areas under the receiver operating characteristic [ROC] curve 0.82 and 0.85, respectively). Hosmer-Lemeshow goodness-of-fit tests showed good calibration for APACHE II, but insufficient calibration for APACHE III. Customization improved calibration for both models, with a good fit for APACHE III as well. However, fit was different for various subgroups.


CONCLUSIONS
The overall goodness-of-fit of APACHE III mortality prediction was improved significantly by customization, but uniformity of fit in different subgroups was not achieved. Therefore, application of the customized model provides no advantage, because differences in case-mix still limit comparisons of quality of care.


Introduction
Scoring systems are used in intensive care to control for various case-mix factors in order to compare patient populations. Score-based predictions of mortality in ICU patients may be used for quality assurance and comparison of quality of care [1][2][3]. If a scoring system is intended to be used in a patient population that is different from the original population used in the development of the system (development sample), then it should be validated in this new population [4][5][6][7].
Calibration measures how closely mortality prognosis fits the observed mortality. Poor calibration in a patient sample does not necessarily mean that the quality of care in that particular ICU is better or worse than in the development sample.
Several clinical case-mix as well as nonclinical factors are not accounted for by such scoring systems [8]. The overall fit of a score in a particular patient sample can be improved by customization using logistic regression. This is possible for the whole population in question [9], but can also be done independently for specific subgroups [10,11]. In the latter case, the customized score can only be used in this subset of patients. If a customized model derived from the whole population is used, then uniformity of fit for the relevant subgroups should still be tested. Knowledge of the influence of subgroups is important, because future changes in case-mix may compromise the improvement achieved by customization.
The aim of the present study was to test the performance of APACHE II and III, after customization of these scales for use in future assessment of quality of care in our unit.

Patients
Over a 3-year period (October 1991-October 1994), 3382 patients were consecutively admitted to the 12-bed interdiciplinary ICU of a 571-bed, university-affiliated community hospital. For the APACHE II analysis, 274 patients who were readmitted to the ICU, 208 patients who were in the ICU for less than 4 h, 16 patients who were admitted for dialysis only, two patients who were younger than 16 years and 87 patients with missing data were excluded. Thus, 2795 patients were included in the analysis. For APACHE III, 79 patients who were admitted to rule out myocardial infarction and 55 cardiosurgical patients were excluded, leaving 2661 for analysis.

Data collection
Data collection was done according to the criteria and definitions described by the developers of APACHE II and III [12,13]. The data were collected by ward doctors after 4 weeks training in how to use the APACHE system. They had access to a detailed manual, including definitions and procedures. Constant supervision by a documentation assistant included regular comparison of the original with the collected data, and review of completeness. In order to assess reliability of data collection, data from a random sample of 50 patients were recorded by two data collectors independently. Interobserver reliability was analyzed by Kendall's coefficient of concordance and κ statistics. In addition, data collection software, which was provided by APACHE Medical Systems Inc (Washington, DC), automatically checked that the data were plausible. The whole data set was tested using a box-plot technique in order to analyze extreme values seperately. Vital status at hospital discharge was recorded.

Statistical analysis
The sample was split randomly into a development (n = 1863 for APACHE II and n = 1772 for APACHE III) and a validation sample (n = 932 for APACHE II and n = 889 for APACHE III). Development of the original model by logistic regression [14] led to the following equation: where β 0 is a constant, β i are coefficients, and x i encompasses the various patient factors that are included in the model. The probability of hospital death is calculated as follows: P(or) = e logit(or) /1 + e logit(or) (2) In the present study the APACHE II equation was used as indicated by the developers [12]. The APACHE III equation was provided by APACHE Medical Systems Inc, and it has not been published for commercial reasons. In customizing the scales, the original logit was used as the independent variable and hospital death was used as the dependent variable. The new probability of hospital death was calculated as follows: and logit(cust) is calculated as follows: where β c 0 is the constant and β c 1 the coefficient derived by logistic regression. The customized coefficients were calculated to be those shown in Table 1.
Discrimination and calibration were analyzed for the original and customized models. Discriminative power was tested by calculating the areas under the ROC curves [15], and calibration was calculated using standardized mortality ratio (SMR; observed deaths/expected deaths), with 95% confidence intervals [16], and using the Hosmer-Lemeshow goodness-of-fit H and C tests [17]. Comparison of the development and validation samples Statistical analysis was performed using the SPSS 6.1 software package (SPSS, Chicago, IL, USA). P < 0.05 was considered statistically significant.

Results
Completeness of data was good; excluding just one variable (24-h urine), 94.6% of all necessary data were collected on average for each patient; 24-h urine was available in only 78.1% of patients. Reliability analysis revealed Kendall's coefficients for clinical and laboratory data above 0.9 except for blood gas values (0.878) and 24-h urine (0.870). κ values were low only for diagnosis of renal failure (0.49) and Glasgow Coma Scale score (0.54). Despite that, differences in calculated scores were very low, with Kendall's coefficients above 0.92. Thus, overall reliability of data collection was good.
Demographic and clinical characteristics were very similar for the development and validation samples (Table 2), and no significant differences were detected. Both models showed good discrimination, which was unchanged by customization.
The original APACHE II prediction calibrated adequately in the patients studied, with minor improvements after customization. APACHE III originally showed inadequate calibration, which was considerably improved by customization, and was adequate afterwards ( Table 3). The calibration curves (Fig. 1) reveal that calibration after customization was good for APACHE II up to the 70-80% mortality risk decile, but was still far from ideal for APACHE III. When interpreting the greater deviations from the ideal line in the 80-90% and 90-100% deciles, the small numbers of cases in these groups have to be borne in mind. Table 4 for APACHE II and in Table 5 for APACHE III. Fit was not uniform for APACHE II, with varying SMRs. Goodness-of-fit was insufficient for patients younger than 65 years and for those directly admitted. Although goodnessof-fit improved for most subgroups after customization, it was still not uniform. These findings were similar for APACHE III. Goodness-of-fit was insufficient for medical, younger, directly admitted and cardiovascular patients. Fit was improved for all but younger and transferred patients. However, it was still not uniform after customization.

Discussion
Customization of APACHE II and III in a large patient population from a single unit led to an improvement in the overall goodness-of-fit of APACHE III, which showed poor calibration in its original version. Despite a similar improvement of fit in several subgroups that were large enough to be tested, good uniformity of fit was not achieved.
These results are comparable with those of a large multicenter study [9] that analyzed customization of the Mortality Prediction Model. In that study, a second-level customization, in which new coefficients were developed for all single patient factors included in the original model, improved calibration even further. Second-level customization was not attempted in the present patient sample because there were not enough patients for that purpose. Time to collect data in a sufficiently large patient sample in a single unit would probably be so great that real changes in case-mix or ICU treatment might occur during the study, which would confound the results. First-level customization will probably be a more practical method for single Available online http://ccforum.com/content/5/1/031 units to improve the overall fit of score systems that are to be used for quality assessment.
At present, however, we would not recommend customization routinely. This is because a major problem is still unresolved; although good calibration can be achieved for the whole patient sample, uniformity of fit remains unsatisfactory. This is the case even for APACHE III, which accounts for more case-mix factors, such as diagnostic categories and lead time, than do the other models. Nevertheless, achievement of uniformity is important, because change in case-mix over time will otherwise lead to a loss of accuracy of a customized model. It would be difficult to interpret whether a change in the mortality ratio over time would be due to a change in quality of care or in case-mix.   If a customized model still has a poor fit for a certain subgroup at a specific unit, then customization for this sample can be attempted separately [10,11]. This could be attempted in medical and cardiovascular patients at our unit for APACHE III, because these groups are sufficiently large and because general customization did not lead to a good fit. However, the practicality of such an approach is questionable.