Statistics review 7: Correlation and regression
- Viv Bewick^{1}Email author,
Affiliated with
- Liz Cheek^{1} and
Affiliated with
- Jonathan Ball^{2}
Affiliated with
DOI: 10.1186/cc2401
© BioMed Central Ltd 2003
Published: 5 November 2003
Advertisement
DOI: 10.1186/cc2401
© BioMed Central Ltd 2003
Published: 5 November 2003
The present review introduces methods of analyzing the relationship between two quantitative variables. The calculation and interpretation of the sample product moment correlation coefficient and the linear regression equation are discussed and illustrated. Common misuses of the techniques are considered. Tests and confidence intervals for the population parameters are described, and failures of the underlying assumptions are highlighted.
The most commonly used techniques for investigating the relationship between two quantitative variables are correlation and linear regression. Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation. For example, in patients attending an accident and emergency unit (A&E), we could use correlation and regression to determine whether there is a relationship between age and urea level, and whether the level of urea can be predicted for a given age.
Age and ln urea for 20 patients attending an accident and emergency unit
Subject | Age (years) | ln urea |
---|---|---|
1 | 60 | 1.099 |
2 | 76 | 1.723 |
3 | 81 | 2.054 |
4 | 89 | 2.262 |
5 | 44 | 1.686 |
6 | 58 | 1.988 |
7 | 55 | 1.131 |
8 | 74 | 1.917 |
9 | 45 | 1.548 |
10 | 67 | 1.386 |
11 | 72 | 2.617 |
12 | 91 | 2.701 |
13 | 76 | 2.054 |
14 | 39 | 1.526 |
15 | 71 | 2.002 |
16 | 56 | 1.526 |
17 | 77 | 1.825 |
18 | 37 | 1.435 |
19 | 64 | 2.460 |
20 | 84 | 1.932 |
On a scatter diagram, the closer the points lie to a straight line, the stronger the linear relationship between two variables. To quantify the strength of the relationship, we can calculate the correlation coefficient. In algebraic notation, if we have two variables x and y, and the data take the form of n pairs (i.e. [x_{1}, y_{1}], [x_{2}, y_{2}], [x_{3}, y_{3}] ... [x_{n}, y_{n}]), then the correlation coefficient is given by the following equation:
where is the mean of the x values, and is the mean of the y values.
For the A&E data, the correlation coefficient is 0.62, indicating a moderate positive linear relationship between the two variables.
5% and 1% points for the distribution of the correlation coefficient under the null hypothesis that the population correlation is 0 in a two-tailed test
r values for two-tailed probabilities (P) | Two-tailed probabilities (P) | ||||
---|---|---|---|---|---|
Sample size | 0.05 | 0.01 | Sample size | 0.05 | 0.01 |
3 | 1.00 | 1.00 | 23 | 0.41 | 0.53 |
4 | 0.95 | 0.99 | 24 | 0.40 | 0.52 |
5 | 0.88 | 0.96 | 25 | 0.40 | 0.51 |
6 | 0.81 | 0.92 | 26 | 0.39 | 0.50 |
7 | 0.75 | 0.87 | 27 | 0.38 | 0.49 |
8 | 0.71 | 0.83 | 28 | 0.37 | 0.48 |
9 | 0.67 | 0.80 | 29 | 0.37 | 0.47 |
10 | 0.63 | 0.76 | 30 | 0.36 | 0.46 |
11 | 0.60 | 0.73 | 40 | 0.31 | 0.40 |
12 | 0.58 | 0.71 | 50 | 0.28 | 0.36 |
13 | 0.55 | 0.68 | 60 | 0.25 | 0.33 |
14 | 0.53 | 0.66 | 70 | 0.24 | 0.31 |
15 | 0.51 | 0.64 | 80 | 0.22 | 0.29 |
16 | 0.50 | 0.62 | 90 | 0.21 | 0.27 |
17 | 0.48 | 0.61 | 100 | 0.20 | 0.26 |
18 | 0.47 | 0.59 | 110 | 0.19 | 0.24 |
19 | 0.46 | 0.58 | 120 | 0.18 | 0.23 |
20 | 0.44 | 0.56 | 130 | 0.17 | 0.23 |
21 | 0.43 | 0.55 | 140 | 0.17 | 0.22 |
22 | 0.42 | 0.54 | 150 | 0.16 | 0.21 |
Although the hypothesis test indicates whether there is a linear relationship, it gives no indication of the strength of that relationship. This additional information can be obtained from a confidence interval for the population correlation coefficient.
To calculate a confidence interval, r must be transformed to give a Normal distribution making use of Fisher's z transformation [2]:
The standard error [3] of z_{r} is approximately:
and hence a 95% confidence interval for the true population value for the transformed correlation coefficient z_{r} is given by z_{r} - (1.96 × standard error) to z_{r} + (1.96 × standard error). Because z_{r} is Normally distributed, 1.96 deviations from the statistic will give a 95% confidence interval.
For the A&E data the transformed correlation coefficient z_{r} between ln urea and age is:
The standard error of z_{r} is:
The 95% confidence interval for z_{r} is therefore 0.725 - (1.96 × 0.242) to 0.725 + (1.96 × 0.242), giving 0.251 to 1.199.
We must use the inverse of Fisher's transformation on the lower and upper limits of this confidence interval to obtain the 95% confidence interval for the correlation coefficient. The lower limit is:
giving 0.25 and the upper limit is:
giving 0.83. Therefore, we are 95% confident that the population correlation coefficient is between 0.25 and 0.83.
The width of the confidence interval clearly depends on the sample size, and therefore it is possible to calculate the sample size required for a given level of accuracy. For an example, see Bland [4].
There are a number of common situations in which the correlation coefficient can be misinterpreted.
One of the most common errors in interpreting the correlation coefficient is failure to consider that there may be a third variable related to both of the variables being investigated, which is responsible for the apparent correlation. Correlation does not imply causation. To strengthen the case for causality, consideration must be given to other possible underlying variables and to whether the relationship holds in other populations.
A nonlinear relationship may exist between two variables that would be inadequately described, or possibly even undetected, by the correlation coefficient.
It is important that the values of one variable are not determined in advance or restricted to a certain range. This may lead to an invalid estimate of the true correlation coefficient because the subjects are not a random sample.
Another situation in which a correlation coefficient is sometimes misinterpreted is when comparing two methods of measurement. A high correlation can be incorrectly taken to mean that there is agreement between the two methods. An analysis that investigates the differences between pairs of observations, such as that formulated by Bland and Altman [5], is more appropriate.
In the A&E example we are interested in the effect of age (the predictor or x variable) on ln urea (the response or y variable). We want to estimate the underlying linear relationship so that we can predict ln urea (and hence urea) for a given age. Regression can be used to find the equation of this line. This line is usually referred to as the regression line.
Note that in a scatter diagram the response variable is always plotted on the vertical (y) axis.
The regression line is obtained using the method of least squares. Any line y = a + bx that we draw through the points gives a predicted or fitted value of y for each value of x in the data set. For a particular value of x the vertical difference between the observed and fitted value of y is known as the deviation, or residual (Fig. 8). The method of least squares finds the values of a and b that minimise the sum of the squares of all the deviations. This gives the following formulae for calculating a and b:
Usually, these values would be calculated using a statistical package or the statistical functions on a calculator.
We can test the null hypotheses that the population intercept and gradient are each equal to 0 using test statistics given by the estimate of the coefficient divided by its standard error.
The test statistics are compared with the t distribution on n - 2 (sample size - number of regression coefficients) degrees of freedom [4].
The 95% confidence interval for each of the population coefficients are calculated as follows: coefficient ± (t_{n-2} × the standard error), where t_{n-2} is the 5% point for a t distribution with n - 2 degrees of freedom.
Regression parameter estimates, P values and confidence intervals for the accident and emergency unit data
Coefficient | Standard error of coefficient | t | P | Confidence interval | |
---|---|---|---|---|---|
Constant, or intercept | 0.72 | 0.346 | 2.07 | 0.054 | -0.01 to +1.45 |
ln urea | 0.017 | 0.005 | 3.35 | 0.004 | 0.006 to 0.028 |
Small data set with the fitted values from the regression, the deviations and their sums of squares
x (mean x = 16) | y (mean y = 38) | Fitted y = 6 + 2x | Unexplained deviation = y - fitted y | Explained deviation = fitted y - mean y | Total deviation = y - mean y |
---|---|---|---|---|---|
10 | 22 | 26 | -4 | -12 | -16 |
10 | 28 | 26 | 2 | -12 | -10 |
20 | 42 | 46 | -4 | 8 | 4 |
20 | 48 | 46 | 2 | 8 | 10 |
19 | 40 | 44 | -4 | 6 | 2 |
17 | 48 | 40 | 8 | 2 | 10 |
Sum of squares | 120 | 456 | 576 |
Analysis of variance for a small data set
Source of variation | Degrees of freedom | Sum of squares | Mean square | F | P |
---|---|---|---|---|---|
Regression | 1 | 456 | 456 | 15.2 | 0.018 |
Residual | 4 | 120 | 30 | ||
Total | 5 | 576 |
If there were no linear relationship between the variables then the regression mean squares would be approximately the same as the residual mean squares. We can test the null hypothesis that there is no linear relationship using an F test. The test statistic is calculated as the regression mean square divided by the residual mean square, and a P value may be obtained by comparison of the test statistic with the F distribution with 1 and n - 2 degrees of freedom [2]. Usually, this analysis is carried out using a statistical package that will produce an exact P value. In fact, the F test from the analysis of variance is equivalent to the t test of the gradient for regression with only one predictor. This is not the case with more than one predictor, but this will be the subject of a future review. As discussed above, the test for gradient is also equivalent to that for the correlation, giving three tests with identical P values. Therefore, when there is only one predictor variable it does not matter which of these tests is used.
Analysis of variance for the accident and emergency unit data
Source of variation | Degrees of freedom | Sum of squares | Mean square | F | P |
---|---|---|---|---|---|
Regression | 1 | 1.462 | 1.462 | 11.24 | 0.004 |
Residual | 18 | 2.342 | 0.130 | ||
Total | 19 | 3.804 |
Another useful quantity that can be obtained from the analysis of variance is the coefficient of determination (R^{2}).
It is the proportion of the total variation in y accounted for by the regression model. Values of R^{2} close to 1 imply that most of the variability in y is explained by the regression model. R^{2} is the same as r^{2} in regression when there is only one predictor variable.
For the A&E data, R^{2} = 1.462/3.804 = 0.38 (i.e. the same as 0.62^{2}), and therefore age accounts for 38% of the total variation in ln urea. This means that 62% of the variation in ln urea is not accounted for by age differences. This may be due to inherent variability in ln urea or to other unknown factors that affect the level of ln urea.
The fitted value of y for a given value of x is an estimate of the population mean of y for that particular value of x. As such it can be used to provide a confidence interval for the population mean [3]. The fitted values change as x changes, and therefore the confidence intervals will also change.
The 95% confidence interval for the fitted value of y for a particular value of x, say x_{p}, is again calculated as fitted y ± (t_{n-2} × the standard error). The standard error is given by:
The fitted value for y also provides a predicted value for an individual, and a prediction interval or reference range [3] can be obtained (Fig. 10). The prediction interval is calculated in the same way as the confidence interval but the standard error is given by:
For example, the 95% prediction interval for the ln urea for a patient aged 60 years is 0.97 to 2.52 units. This transforms to urea values of 2.64 to 12.43 mmol/l.
Both confidence intervals and prediction intervals become wider for values of the predictor variable further from the mean.
The use of correlation and regression depends on some underlying assumptions. The observations are assumed to be independent. For correlation both variables should be random variables, but for regression only the response variable y must be random. In carrying out hypothesis tests or calculating confidence intervals for the regression parameters, the response variable should have a Normal distribution and the variability of y should be the same for each value of the predictor variable. The same assumptions are needed in testing the null hypothesis that the correlation is 0, but in order to interpret confidence intervals for the correlation coefficient both variables must be Normally distributed. Both correlation and regression assume that the relationship between the two variables is linear.
In addition, a Normal plot of residuals can be produced. This is a plot of the residuals against the values they would be expected to take if they came from a standard Normal distribution (Normal scores). If the residuals are Normally distributed, then this plot will show a straight line. (A standard Normal distribution is a Normal distribution with mean = 0 and standard deviation = 1.) Normal plots are usually available in statistical packages.
When using a regression equation for prediction, errors in prediction may not be just random but also be due to inadequacies in the model. In particular, extrapolating beyond the range of the data is very risky.
A phenomenon to be aware of that may arise with repeated measurements on individuals is regression to the mean. For example, if repeat measures of blood pressure are taken, then patients with higher than average values on their first reading will tend to have lower readings on their second measurement. Therefore, the difference between their second and first measurements will tend to be negative. The converse is true for patients with lower than average readings on their first measurement, resulting in an apparent rise in blood pressure. This could lead to misleading interpretations, for example that there may be an apparent negative correlation between change in blood pressure and initial blood pressure.
Both correlation and simple linear regression can be used to examine the presence of a linear relationship between two variables providing certain assumptions about the data are satisfied. The results of the analysis, however, need to be interpreted with care, particularly when looking for a causal relationship or when using the regression equation for prediction. Multiple and logistic regression will be the subject of future reviews.
accident and emergency unit
natural logarithm (logarithm base e).