Statistics review 1: Presenting and summarising data
© BioMed Central Ltd 2002
Published: 29 November 2001
The present review is the first in an ongoing guide to medical statistics, using specific examples from intensive care. The first step in any analysis is to describe and summarize the data. As well as becoming familiar with the data, this is also an opportunity to look for unusually high or low values (outliers), to check the assumptions required for statistical tests, and to decide the best way to categorize the data if this is necessary. In addition to tables and graphs, summary values are a convenient way to summarize large amounts of information. This review introduces some of these measures. It describes and gives examples of qualitative data (unordered and ordered) and quantitative data (discrete and continuous); how these types of data can be represented figuratively; the two important features of a quantitative dataset (location and variability); the measures of location (mean, median and mode); the measures of variability (range, interquartile range, standard deviation and variance); common distributions of clinical data; and simple transformations of positively skewed data.
Tables and graphs provide a convenient simple picture of a set of data (dataset), but it is often necessary to further summarize quantitative data, for example for hypothesis testing. The two most important elements of a dataset are its location (where on average the data lie) and its variability (the extent to which individual data values deviate from the location). There are several different measures of location and variability that can be calculated, and the choice of which to use depends on individual circumstances.
The mean is the most well known average value. It is calculated by summing all of the values in a dataset and dividing them by the total number of values. The algebraic notation for the mean of a set of n values (X1, X2,...,Xn) is:
Of all the measures of location, the mean is the most commonly used because it is easily understood and has useful mathematical properties that make it convenient for use in many statistical contexts. It is strongly influenced by extreme values (outliers), however, and is most representative when the data are symmetrically distributed (see below).
The median is the central value when all observations are sorted in order. If there is an odd number of observations then it is simply the middle value; if there is an even number of observations then it is the average of the middle two. The median does not have the beneficial mathematical properties of the mean. However, it is not generally influenced by extreme values (outliers), and as a result it is particularly useful in situations where there are unusually low or high values that would render the mean unrepresentative of the data.
The mode is simply the most commonly occurring value in the data. It is not generally used because it is often not representative of the data, particularly when the dataset is small.
Example of calculating location
Haemoglobin (g/dl) from 48 intensive care patients
Mean, median and mode of haemoglobin measurements from 48 intensive care patients listed in Table 1
There are 48 observations in this dataset and so the median is the average of the 24th and 25th (i.e. the average of 9.7 and 9.9 = 9.8 g/dl)
Several values appear twice in this dataset, 9.9 appears three times and 9.4 appears four times. No value appears more than four times and so the mode is 9.4 g/dl
Notice that the mean and the median are similar. This is because the data are approximately symmetrical. In general, the mean, median and mode will be similar in a dataset that has a symmetrical distribution with a single peak, such as that shown in Fig. 2. However, the dataset presented here is rather small and so the mode is not such a good measure of location.
As with location, there are a number of different measures of variability. The simplest of these is probably the range, which is the difference between the largest and smallest observation in the dataset. The disadvantage of this measure is that it is based on only two of the observations and may not be representative of the whole dataset, particularly if there are outliers. In addition, it gives no information regarding how the data are distributed between the two extremes.
An alternative to the range is the interquartile range. Quartiles are calculated in a similar way to the median; the median splits a dataset into two equally sized groups, tertiles split the data into three (approximately) equally sized groups, quartiles into four, quintiles into five, and so on. The interquartile range is the range between the bottom and top quartiles, and indicates where the middle 50% of the data lie. Like the median, the interquartile range is not influenced by unusually high or low values and may be particularly useful when data are not symmetrically distributed. Ranges based on alternative subdivisions of the data can also be calculated; for example, if the data are split into deciles, 80% of the data will lie between the bottom and top deciles and so on.
The standard deviation is a measure of the degree to which individual observations in a dataset deviate from the mean value. Broadly, it is the average deviation from the mean across all observations. It is calculated by squaring the difference of each individual observation from the mean (squared to remove any negative differences), adding them together, dividing by the total number of observations minus 1, and taking the square root of the result.
Algebraically the standard deviation for a set of n values (X1,X2,...,Xn} is written as follows:
Another measure of variability that may be encountered is the variance. This is simply the square of the standard deviation:
The variance is not generally used in data description but is central to analysis of variance (covered in a subsequent review in this series).
Example of calculating variability
Range, interquartile range and standard deviation of haemoglobin measurements from 48 intensive care patients listed in Table 1
The values in this dataset range from 5.4 to 14.1 g/dl
The median calculated in Table 2 splits the data into two equally sized groups. The lower and upper quartiles split the data into four equally sized groups (4 × 12) and are therefore most easily defined as the average of the 12th and 13th observations for the lower quartile and of the 36th and 37th observations for the upper quartile. In other words, the lower and upper quartiles are 8.7 and 10.8 g/dl, respectively. (There are more complicated methods for calculating the interquartile range , but these will not generally give markedly different results.)
Standard deviation (SD)
Using the formula given above:
The interquartile range (which contains the central 50% of the data) gives a better indication of the general shape of the distribution, and indicates that 50% of all observations fall in a rather narrower range (from 8.7 to 10.8 g/dl). In addition, the median and mean both fall approximately in the centre of the interquartile range, which suggests that the distribution is reasonably symmetrical.
The standard deviation in isolation does not provide a great deal of information, although it is sometimes expressed as a percentage of the mean, known as the coefficient of variation. However, it is often used to calculate another extremely useful quantity known as the reference range; this will be covered in more detail in the next article.
Common distributions and simple transformations
Quantitative clinical data follow a wide variety of distributions, but the majority are unimodal, meaning that the data has a single (modal) peak with a tail on either side. The most common of these unimodal distributions are symmetrical, as shown in Fig. 2, with a peak in the centre of the data and evenly balanced tails on the right and left.
The mean of these data is 12.25 mmol/l (A) and the median is 9 mmol/l (B), as indicated in Fig. 3. In a positively skewed distribution the median will always be smaller than the mean because the mean is strongly influenced by the extreme values in the right-hand tail, and may therefore be less representative of the data as a whole. However, it is possible to transform data of this type in order to obtain a more representative mean value. This type of transformation is also useful when statistical tests require data to be more symmetrically distributed (see subsequent reviews in this series for details). There is a wide range of transformations that can be used in this context , but the most commonly used with positively skewed data is the logarithmic transformation.
Raw and logarithmically transformed serum urea levels
Calculations and statistical tests can now be carried out on the transformed data before converting the results back to the original scale. For example, the mean of the transformed serum urea data is 2.19. To transform this value back to the original scale, the antilog (or exponential in the case of natural, base e logarithms) is applied. This gives a 'geometric mean' of 8.94 mmol/l on the original scale (C in Fig. 3), the term 'geometric' indicating that calculations have been carried out on the logarithmic scale. This is in contrast to the standard (arithmetic) mean value (calculated on the original scale) of 12.25 mmol/l (A in Fig. 3). Looking at Fig. 3, it is clear that the geometric mean is more representative of the data than the arithmetic mean.
Finally, it is possible that data may arise with more than one (modal) peak. These data can be difficult to manage and it may be the case that neither the mean nor the median is a representative measure. However, such distributions are rare and may well be artefactual. For example, a (bimodal) distribution with two peaks may actually be a combination of two uni-modal distributions (such as hormone levels in men and women). Alternatively, a (multimodal) distribution with multiple peaks may be due to digit preference (rounding observations up or down) during data collection, where peaks appear at round numbers, for example peaks in systolic blood pressure at 90, 100, 110, 120 mmHg, and so on. In such cases appropriate subdivision, categorization, or even recollection of the data may be required to eliminate the problem.