Training in data definitions improves quality of intensive care data
© Arts et al., licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. 2003
Received: 15 October 2002
Accepted: 22 January 2003
Published: 18 February 2003
Our aim was to assess the contribution of training in data definitions and data extraction guidelines to improving quality of data for use in intensive care scoring systems such as the Acute Physiology and Chronic Health Evaluation (APACHE) II and Simplified Acute Physiology Score (SAPS) II in the Dutch National Intensive Care Evaluation (NICE) registry.
Before and after attending a central training programme, a training group of 31 intensive care physicians from Dutch hospitals who were newly participating in the NICE registry extracted data from three sample patient records. The 5-hour training programme provided participants with guidelines for data extraction and strict data definitions. A control group of 10 intensive care physicians, who were trained according the to train-the-trainer principle at least 6 months before the study, extracted the data twice, without specific training in between.
In the training group the mean percentage of accurate data increased significantly after training for all NICE variables (+7%, 95% confidence interval 5%–10%), for APACHE II variables (+6%, 95% confidence interval 4%–9%) and for SAPS II variables (+4%, 95% confidence interval 1%–6%). The percentage data error due to nonadherence to data definitions decreased by 3.5% after training. Deviations from 'gold standard' SAPS II scores and predicted mortalities decreased significantly after training. Data accuracy in the control group did not change between the two data extractions and was equal to post-training data accuracy in the training group.
Training in data definitions and data extraction guidelines is an effective way to improve quality of intensive care scoring data.
Keywordsdata extraction data quality medical registry training program
A number of regional and (inter)national intensive care registries have been developed to enable quality assessment in intensive care, including the British Intensive Care National Audit & Research Centre (ICNARC) and the IMPACT project from the American Society of Critical Care Medicine. These registries enable evaluation of the effectiveness and efficiency of the care process. In 1996 the National Intensive Care Evaluation (NICE) registry was set up in The Netherlands for the same reason. A minimum data set is extracted from the patient record for every patient admitted to each of 21 intensive care units (ICUs) currently participating in the NICE registry. Scoring systems such as the Acute Physiology and Chronic Health Evaluation (APACHE) II  and the Simplified Acute Physiology Score (SAPS) II  are used to calculate a score for each patient based on the most abnormal data from the first 24 hours following intensive care admission; from this they quantify the severity of illness and calculate the corresponding probability of in-hospital mortality. As an indicator for quality assessment of intensive care, the observed mortality in the intensive care population is compared with the calculated case–mix corrected mortality in that population.
The use of such an intensive care registry for quality assessment strongly depends on the quality of registry data. Several studies have shown interobserver and intraobserver variability in calculation of severity-of-illness scores [3–10]. Fery-Lemonnier and coworkers  discussed some of the problems that cause inaccurate APACHE II data collection, including ambiguous definitions and complex calculations. Chen and coworkers  also cited lack of clear instructions concerning the timing of APACHE II data collection as a source of variability.
In order to improve quality of data, the NICE registry implemented a framework of quality assurance procedures . As a part of this framework the NICE foundation has defined all variables. The NICE data definitions are present in a data dictionary that is available on the internet . At least two physicians per ICU are obliged to attend a central training session organized by the NICE board, during which the data definitions are discussed. Physicians who have attended the central training session train their local staff. This is called the 'train-the-trainer' principle. The objective of the present study was to investigate whether this total training concept improves data quality and increases the validity of severity-of-illness scoring in the Dutch NICE registry.
Two intensive care physicians who were experienced in severity scoring and formerly involved in composing the NICE data definitions selected three sample patient cases. These cases were modified to include some potential pitfalls in data extraction (e.g. abnormal physiological values just before or 24 hours after admission to the ICU). The NICE dataset consists of 88 variables (37 categorical variables, 43 numerical variables, six date/time variables, and two strings). In order to reduce errors associated with identifying the worst value, NICE requires the lowest and the highest values recorded in the first 24 hours, Subsequently a central computer algorithm selects the worst value. The standardized data definitions used in the NICE registry are in agreement with widely accepted data definitions used in the severity-of-illness scoring models (e.g. APACHE II and SAPS II) [1, 2, 13–15]. According to these definitions, the two physicians reached consensus on values for all data items for the three sample patient cases. These values were considered the 'gold standard'.
Between February 1999 and May 2001, four central NICE training sessions took place, which were attended by a total of 31 participants. Training session participants were physicians from ICUs that intended to participate in the NICE registry. Each training session took approximately 5 hours. All training participants received a copy of the NICE data dictionary. During the sessions the definitions of all variables and the data extraction guidelines were discussed and practiced with some patient cases. These central training sessions were given by members of the NICE board who had been involved in the composition of the NICE data definitions and were highly experienced with severity-of-illness scoring systems. All training participants received photocopies of the records from the three selected sample patient cases. They were asked to extract the NICE data from these records into specially designed paper forms 1 week before attending the training session and within 1 month afterward.
In order to assess the effect on quality of data extracted for the same patient records twice (without training), photocopies of the records from the same three sample patient cases were issued to a control group. The control group consisted of 10 randomly selected physicians and registrars working in one of the ICUs that had been routinely extracting NICE data for several years. The control group had been locally instructed on data definitions and guidelines at least 6 months before the study, according to the 'train-the-trainer' principle, by one of the intensive care staff who had previously attended a central NICE training session. A copy of the NICE data dictionary was available to all control group members. The control group was asked to extract data from the sample patient records twice at an interval of 4–6 weeks without training in between.
After the first extraction both the training group and the control group were informed about the study design, implying that there would be a second data extraction 1 month after the first. Participants in the training group and in the control group did not receive their results for the first data extraction before they had completed the second.
Analysis of data quality
For both groups the recorded data were included only if a physician or registrar had extracted the data twice (before and after training, or for the first and the second data extractions). We analysed the accuracy of recorded data for three different data subsets: all variables in the NICE registry (n = 88), APACHE II variables (n = 35) and SAPS II variables (n = 26). Data accuracy for all three subsets was determined by comparing the extracted data with the gold standard data. The criteria used for analyzing the accuracy of APACHE II and SAPS II variables were different from those used for all NICE variables.
Criteria for all NICE variables
When assessing data quality for all NICE variables, categorical values, strings, dates and times were judged inaccurate when they were incomplete (an item left blank when, according to the gold standard, it was available) or not equal to the gold standard value. Numerical values were considered inaccurate when they were incomplete or deviated from the gold standard value by a degree greater than was considered acceptable. For example, a deviation in systolic blood pressure of more than 10 mmHg below or above the gold standard systolic blood pressure was considered inaccurate. Detailed criteria are available from the authors.
Criteria for APACHE II variables and SAPS II variables
APACHE II and SAPS II data were judged inaccurate if they caused a deviation from the gold standard score for that particular APACHE II or SAPS II variable. For instance: a recorded mean blood pressure of 130 mmHg instead of the gold standard value of 127 would be considered inaccurate because the first results in 3 APACHE II points for blood pressure and the latter in only 2 APACHE II points.
Values are presented as percentage accurate data and as absolute deviations from gold standard APACHE II and SAPS II scores and predicted mortalities. The percentage accurate data per participant per case was calculated by dividing the number of correctly recorded data items by the total number of data items that should have been recorded. A 95% confidence interval was calculated for all medians. For the training group and for the control group, differences in percentages of accurate data or in deviations from gold standard scores between the first and the second scoring were tested using the Wilcoxon signed rank sum test. P < 0.05 was considered statistically significant. We analyzed the type of data errors and compared their frequency of occurrence between the two data extractions by the training group. All data analyses were performed using SPSS software version 10.0 (SPSS Inc, Chicago, IL, USA).
Of all 31 training participants, 22 extracted the NICE data for one or more of the three sample patient cases before and after training. A total of 55 sample cases were evaluated.
Intensive care physicians (22)
Intensive care physicians (6), Intensive care registrars (2)
Completed patient cases (total)
Prior experience in
Percentage complete and accurate data items
Percentage complete and accurately recorded data items for all NICE variables, APACHE II variables and SAPS II variables
All variables NICE (n = 88)
APACHE II variables (n = 35)
SAPS II variables (n = 26)
Training group (60 cases)
Control group (24 cases)
1st data extraction
2nd data extraction
3 (-10 to +9)
1 (-6 to +7)
1 (-10 to +5)
NICE data items with percentage accuracy below 75% before training in the training group (for all NICE variables)
Alveolar–arterial oxygen difference
Mean arterial blood pressure
APACHE II diagnosis
Urine output (8 hours)
Glasgow Coma Scale score
All data items
Types of data errors
Types of data extraction errors and their frequency of occurrence in the training group before and after training (for all NICE variables)
Data error type
Nonadherence to data definitions
Inclusion of values outside first 24 hours
Other (errors that could not directly be accounted for)
Total inaccurate data
Severity of illness scores and predicted mortalities
Deviation from the gold standard severity-of-illness scores and mortality probabilities for training and control groups
SAPS II score
SAPS II probability of death
APACHE II score
APACHE II probability of death
Training group (60 cases)
7.5 (5 to 10)
0.17 (0.11 to 0.23)
4 (2 to 5)
0.14 (0.11 to 0.21)
4 (3 to 6)
0.09 (0.06 to 0.12)
2.5 (2 to 4)
0.12 (0.07 to 0.17)
-2 (-4 to 0)*
-0.05 (-0.09 to 0)*
0 (-1 to +1)
0 (-0.01 to +0.10)
Control group (24 cases)
5 (3 to 8)
0.11 (0.07 to 0.19)
3 (2 to 5)
0.11 (0.07 to 0.17)
5.5 (2 to 9)
0.12 (0.04 to 0.19)
2 (1 to 3)
0.10 (0.04 to 0.17)
0.5 (0 to 3)
0.008 (0 to 0.07)
-1 (-3 to +1)
-0.002 (-0.06 to +0.04)
Based on the results of the present study we may conclude that training in data definitions and data collection guidelines improves data quality in general. Before training, many variables were incorrect and incomplete; this was probably due to the fact that participants were unacquainted with most of the data definitions. After training, completeness and adherence to data definitions increased significantly.
It could be argued that the decrease in errors after training was not the result of training but simply the result of extracting the same data twice within 2 months. Therefore, a control group was included, members of which also extracted the data of the same three cases twice with an interval of 4–6 weeks but without any training (or other intervention) in between. In that group, no difference was observed in data accuracy between the first and second data extractions. The fact that data quality did not change between the first and second data extractions in the control group suggests that simply assessing the same cases twice did not influence data quality in the training group. However, it cannot be ruled out that data quality in the control group was already optimal at the first data extraction, making it impossible to improve further ('ceiling effect'). Data accuracy in the control group was equal to the data accuracy after training in the training group. We could conclude from this that central and local training sessions (in the control group) are equally effective and that the effect of training remains for a longer period. Alternatively, the difference in baseline data quality might have been a reflection of different characteristics of both groups. For example, the number of participants in both groups was not equal, and the control group consisted of physicians from one hospital whereas the participants in the training group were from different sites.
We only evaluated the short-term effects of training on data quality. Further studies are necessary to determine how long these beneficial effects will last.
In the present study we found that, after training, 14% of all data items were still incomplete or inaccurate. We recently examined the quality of data contained in the NICE registry and found a considerable lower error rate (6%) . The different findings in these two studies may have various reasons. First, in the present study the cases were specially selected for evaluation of data quality and contained many artificially incorporated pitfalls in data extraction. Second, in contrast to real data extraction for the NICE registry, no automatic data checks were run on the extracted data. Finally, in reality, data are extracted by the treating physician. For this study, the physicians had to extract data from copies of patient records that they were not familiar with and from patients they had never seen.
The NICE registry is primarily used to calculate severity-of-illness scores and predicted mortality based on these scores. The validity of SAPS II scores and mortality probabilities improved after training, whereas the validity of APACHE II scores and mortality probabilities did not. This difference is probably accounted for by the fact that, before training, only a few participants were experienced in SAPS II data collection and almost all participants were familiar with APACHE II data collection. The deviations from gold standard scores and probabilities found in the present study, even after training, are still considerable and may not support their use in clinical practice. However, the reliability of severity-of-illness scores was found to be sufficient by two other studies [7, 16]. These different findings can be explained by differences in study designs, such as the incorporation of artificial pitfalls in the present study.
Four APACHE II variables, namely temperature, alveolar–arterial oxygen difference, mean arterial blood pressure and APACHE II diagnostic category, exhibited a high percentage of incorrect data before and after training. These variables have similarly been reported by other researchers to have low accuracy and reliability rates [3, 4, 7]. In a study conducted by Chen and coworkers , variables involving calculations, such as the alveolar oxygen difference, were found to have the lowest agreement. Several studies suggest that the ambiguous definitions for some of the APACHE II medical terms are an important cause of the wide interobserver variations [3–5]. Physicians, researchers and decision makers should be aware of the variability in severity-of-illness scores and mortality probabilities, and take them into account. To increase data accuracy and reduce variability on an international level, there should be an international agreement on unambiguous definitions for all variables used in APACHE II and SAPS II models.
A study conducted by Polderman and coworkers  showed that training in data extraction reduced the interobserver variability in APACHE II scoring in a single university hospital setting. It is possible that the positive effect of training in their study was overestimated because their training programme focused on the errors observed in the first data extraction episode, before training. The content of our NICE training programme was determined before we started the present study and was not affected by the results of the first data extraction.
Many ICUs collect severity-of-illness scores. Data transcription from a patient record to a case record form is the most commonly used method for collection of these scores. Therefore, we believe that the positive effect of training found in the present study will also be found in other intensive care registries and clinical trials.
Although it is probably not possible to have an intensive care registry that is completely free of errors, this study shows that centrally organized training in data definitions, which is further diffused by the train-the-trainer principle, is an important basis for good data quality.
Good definitions for all data items are a prerequisite for highly accurate data collection
Training in data extraction and definitions appears to be effective in improving quality of intensive care data
Within the scoring system data, the positive training effect was only proven for the SAPS II data, in which the study population was less experienced as compared with APACHE II
The positive effect of central training may be diffused by the train-the-trainer principle
Medical registries should implement quality assurance programmes in order to optimize their data quality. These programmes should include training sessions, in combination with other quality assurance procedures
= Acute Physiology and Chronic Health Evaluation
= intensive care unit
= National Intensive Care Evaluation
= Simplified Acute Physiology Score.
We thank Jeremy Wyatt for his valuable comments on this manuscript.
- Knaus WA, Draper EA, Wagner DP, Zimmerman JE: APACHE II: a severity of disease classification system. Crit Care Med 1985, 13: 818-829.View ArticlePubMedGoogle Scholar
- Le Gall JR, Lemeshow S, Saulnier F: A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA 1993, 270: 2957-2963. 10.1001/jama.270.24.2957View ArticlePubMedGoogle Scholar
- Fery-Lemonnier E, Landais P, Loirat P, Kleinknecht D, Brivet F: Evaluation of severity scoring systems in ICUs: translation, conversion and definition ambiguities as a source of interobserver variability in Apache II, SAPS and OSF. Intensive Care Med 1995, 21: 356-360.View ArticlePubMedGoogle Scholar
- Chen LM, Martin CM, Morrison TL, Sibbald WJ: Interobserver variability in data collection of the APACHE II score in teaching and community hospitals. Crit Care Med 1999, 27: 1999-2004. 10.1097/00003246-199909000-00046View ArticlePubMedGoogle Scholar
- Holt AW, Bury LK, Bersten AD, Skowronski GA, Vedig AE: Prospective evaluation of residents and nurses as severity score data collectors. Crit Care Med 1992, 20: 1688-1691.View ArticlePubMedGoogle Scholar
- Goldhill DR, Sumner A: APACHE II, data accuracy and outcome prediction. Anaesthesia 1998, 53: 937-943. 10.1046/j.1365-2044.1998.00534.xView ArticlePubMedGoogle Scholar
- Damiano AM, Bergner M, Draper EA, Knaus WA, Wagner DP: Reliability of a measure of severity of illness: acute physiology of chronic health evaluation II. J Clin Epidemiol 1992, 45: 93-101.View ArticlePubMedGoogle Scholar
- Polderman KH, Thijs LG, Girbes AR: Interobserver variability in the use of APACHE II scores. Lancet 1999, 353: 380.View ArticlePubMedGoogle Scholar
- Polderman K, Christiaans H, Wester J, Spijkstra J, Girbes A: Intra-observer variability in APACHE II scoring. Intensive Care Med 2001, 27: 1550-1552. 10.1007/s001340101033View ArticlePubMedGoogle Scholar
- Polderman K, Jorna E, Girbes A: Interobserver variability in APACHE II scoring: effect of strict guidelines and training. Intensive Care Med 2001, 27: 1365-1369. 10.1007/s001340101012View ArticlePubMedGoogle Scholar
- Arts DGT, de Keizer NF, Scheffer GJ: Defining and improving data quality in medical registries: a literature review, case study, and generic framework. J Am Med Inform Assoc 2002, 9: 600-611. 10.1197/jamia.M1087PubMed CentralView ArticlePubMedGoogle Scholar
- National Intensive Care Evaluation (NICE).[http://www.stichtingnice.nl]
- Lemeshow S, Klar J, Teres D, Spitz Avrunin J, Gehlbach S, Rapoport J, Rué M: Mortality probability models for patients in the intensive care unit for 48 or 72 hours: a prospective, multicenter study. Crit Care Med 1994, 22: 1351-1358.View ArticlePubMedGoogle Scholar
- Knaus WA, Wagner DP, Draper EA, Zimmerman JE, Bergner M, Bastos PG, Sirio CA, Murphy DJ, Lotring T, Damiano A: The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991, 100: 1619-1636.View ArticlePubMedGoogle Scholar
- Le Gall JR, Klar J, Lemeshow S, Saulnier F, Alberti C, Artigas A, Teres D: The Logistic Organ Dysfunction system. A new way to assess organ dysfunction in the intensive care unit. ICU Scoring Group. JAMA 1996, 276: 802-810. 10.1001/jama.276.10.802View ArticlePubMedGoogle Scholar
- Arts D, de Keizer N, Scheffer G, de Jonge E: Quality of data collected for severity of illness scores in the Dutch National Intensive Care Evaluation (NICE) registry. Intensive Care Med 2002, 28: 656-659. 10.1007/s00134-002-1272-zView ArticlePubMedGoogle Scholar