Clinical Chemistry 52: 1848-1850, 2006;
10.1373/clinchem.2006.068296
(Clinical Chemistry. 2006;52:1848-1850.)
© 2006 American Association for Clinical Chemistry, Inc.
Information for Authors: Is the Advice Regarding the Reporting of Residuals in Regression Analysis Incomplete? Should Cooks Distance Be Included?
A. Ralph Henderson1
Department of Biochemistry, University of Western Ontario, London, Ontario, Canada
 |
Introduction
|
|---|
If regression analysis is used for statistical evaluation of the data, authors must supply ... standard deviations of residuals (Sy|x, often called standard errors of estimates)... Residuals plots [e.g., Bland-Altman] are often useful.
Extract from "Information for Authors" (2006)
The Clinical Chemistry "Information for Authors" recommends that, when regression analysis is used, SDs of residuals must be supplied. (They are not always provided.) As Cook and Weisberg note (1), this conceptual approach dates back to the early 1960s, but by the late 1970s, attention was increasingly directed to assessing the influence of individual observations on the results of regression analysis.
The concept of influence (or leverage) can be illustrated by 2 simple examples. In Fig. 1A
, the regression line is shown for 4 in-line cases. When case 5 is added, the new regression line is slightly leveraged toward it (Fig. 1C
), but note that the case 5 residual is large (Fig. 1E
) and the regression lines are nearly parallel. However, when case 5 (Fig. 1B
) is added, the new regression line is much more influenced by its presence (Fig. 1D
). This case forces the regression line close to it, and its residual is correspondingly small (Fig. 1F
). What are the differences between these 2 cases? When an outlier is close to the mean value of x (as in case 5 in Fig. 1A
), its influence is small (Fig. 1C
), whereas when the outlier is a long way from the mean value of x and out of line of the initial regression line (as in case 5 in Fig. 1B
), its influence is large (Fig. 1D
). It will be noted that the respective values of the residuals do not reflect the effect of influence, because an influential case may decrease the magnitude of the residual. Supplemental Fig. 1 displays the same data with Deming regression (see Fig. 1 in the Data Supplement that accompanies the online version of this Opinion at http://www.clinchem.org/content/vol52/issue10). The Deming residuals mirror those shown in Fig. 1
, although they are of greater magnitude. Again, no relationship exists between the residuals and leverage.

View larger version (10K):
[in this window]
[in a new window]
|
Figure 1. The effect of an outlier on the resulting least squares regression line.
Solid lines show the regression line with all 5 cases. (A), when case 5 outlier is near to the mean of the x values. (B), when case 5 outlier is distant from the mean of the x values. (C), leverage values for all 5 cases in A. (D), leverage values for all 5 cases in B. (E), crude residual values for all 5 cases in A. (F), crude residual values for all 5 cases in B.
|
|
 |
leverage
|
|---|
How is leverage calculated? Each individual case possesses leverage (hi), which is determined as follows (1), where the subscripts (i or j) denote the ith or jth case of a sample.
 | (1) |
The minimum possible value of hi is 1/n when xi =
, and obviously, hi increases as xi moves away from
. Leverage may take any positive value between 0 and 1 (0
hi
1). Any hi value may be considered highly influential (2) if it exceeds certain limits, although opinions vary about which expression to use:
 | (2) |
where p' = p + 1 when the intercept is included in the number of terms in the regression model (note that p = the number of terms in a regression equation excluding the intercept; here, in the case of linear regression, p = 1 and p' = 2). The leverage values (Eq. 2a
) for these 2 case 5s (Fig. 1
, C and D) are 0.23 and 0.9, respectively. The value of 0.9 indicates that the 2nd case 5 is highly influential, because the limiting value for Eq. 2a
is 0.8.
The sum of all leverages is as follows:
 | (3) |
Thus, the mean leverage is p'/n, and the criteria shown in Eq. 2
suggest examination of any leverage values greater than 23 times the mean.
Leverage may also be expressed as a leverage measure:
 | (4) |
I here make the assumption that these definitions of leverage also apply to the Deming regression model.
 |
residuals
|
|---|
For the ordinary linear regression equation, residuals (Fig. 1
, E and F) are defined as:
 | (5) |
where
indicates the fitted value of y. Note that e is measured on the abscissa (i.e., along the y axis).
Residuals can be standardized to correct for unequal variance caused by unequal leverages by dividing the residual by an estimate of its SD (3) minus the standardized residual as follows:
 | (6) |
The calculation of Deming residuals (edem) is slightly more complicated:
 | (7) |
where the signs of edemi are derived from the value of ei (Eq. 5
),
and
are fitted values, and
is the ratio of the error variances of the x and y values (4).
The Deming standardized residual (rdem) is as follows:
 | (8) |
 |
combining leverages and residuals
|
|---|
Clearly, the results of linear regression are influenced by 2 factors: leverage and residuals. In 1977, R.D. Cook (5) proposed a measuresince called Cooks distancethat combined these factors. Cooks distance (Di) is defined as follows.
 | (9) |
Di combines the standardized residual (Eq. 6
) with the leverage measure (Eq. 4
) and the number of terms in the regression equation. Because the value of p' is fixed, the magnitude of D for any case is determined by 2 elements: the size of ri (reflecting the lack of fit of the proposed model at the ith case) and the size of hi (reflecting the location of xi relative to
). A large value of either or both elements will lead to a large value of Di. It should be noted that although Cooks distance is not a statistical test, an approximate numerical cutoff (Eq. 10
) has been suggested (6).
 | (10) |
It should be noted that several alternatives to Cooks distance have been proposed (3)(6)(7), although for various reasons they are less favored. Stuart et al. (3) recommend that one or other of these indices should always be examined.
The generation of Cooks distance values for a data set might seem daunting, but it should be realized that this capability is available in many statistical programs, such as SPSS, SAS, Minitab, S-Plus, R (8), and Arc(9)(10). The latter 2 programs are freely available. It is of interest that the 3 statistical programs with clinical chemistry applications (Analyze-it, MedCalc, and CBstat) do not (yet) provide this capability. The Deming Cooks distance equivalent is obtained by replacing ri by rdemi (Eq. 8
), although software for calculating these values is not currently available.
 |
an example
|
|---|
I have chosen to use a medical data set (11), used by Altman in his textbook (12), to illustrate the factors that affect the values of Cooks distance (see Figs. 2 and 3 in the online Data Supplement).
 |
conclusion and a proposal
|
|---|
Reporting only residual values (which the journal requires) does not always identify cases that influence the resulting regression line. The more thorough approach, with the use of Cooks distance (or its equivalents), provides much more insight regarding the regression model. Many current texts illustrate the use and value of estimating Cooks distance in linear regression (9)(13)(14)(15).
Accordingly, I propose that the provision of Cooks distance values or other similar measure should be encouraged when regression analysis is reported. Such a suggestion is particularly valuable when the sample size is small.
 |
Acknowledgments
|
|---|
I thank Drs. Sanford Weisberg and Dennis Cook (both at the University of Minnesota) and the technical support staff at Insightful Corp. for helpful advice.
 |
Footnotes
|
|---|
1 Dr. Henderson died on June 29, 2006. 
 |
References
|
|---|
- Cook RD, Weisberg S. Residuals and Influence in Regression 1982 Chapman and Hall New York. .
- Crawley MJ. Statistical Computing. An Introduction to Data Analysis using S-Plus 2002 John Wiley & Sons, Ltd. Chichester, United Kingdom. .
- Stuart A Ord JK Arnold S eds. Kendalls Advanced Theory of Statistics. Volume 2A: Classical Inference & the Linear Model, 6th ed 1999 Arnold London. .
- Linnet K. Estimation of the linear relationship between the measurements of two methods with proportional errors. Stat Med 1990;9:1463-1473.[Web of Science][Medline]
[Order article via Infotrieve]
- Cook RD. Detection of influential observations in linear regression. Technometrics 1977;19:15-18.[CrossRef][Web of Science]
- Fox J. Applied Regression Analysis, Linear Models, and Related Methods 1997 SAGE Publications, Inc. Thousand Oaks, CA. .
- Cook RD, Weisberg S. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 1990;22:495-508.[CrossRef]
- R Development Core Team. The R Project for Statistical Computing. http://www.R-project.org (accessed November 2, 2004)..
- Cook RD, Weisberg S. Applied Regression Including Computing and Graphics 1999 John Wiley & Sons, Inc. New York. .
- Cook RD, Weisberg, S. Arc Software. http://www.stat.umn.edu/arc/software.html (accessed August 16, 2006)..
- Thuesen L, Christiansen JS, Falstie-Jensen N, Christensen CK, Hermansen K, Mogensen CE, et al. Increased myocardial contractility in short-term type 1 diabetic patients: an echocardiographic study. Diabetologia 1985;28:822-826.[CrossRef][Web of Science][Medline]
[Order article via Infotrieve]
- Altman D. Practical Statistics for Medical Research 1991 Chapman and Hall London. .
- Strike PW. Statistical Methods in Laboratory Medicine 1991 Butterworth-Heineman Ltd. Oxford. .
- Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research, 4th ed 2002 Blackwell Science Ltd. Oxford. .
- Weisberg S. Applied Linear Regression, 3rd ed 2005 Wiley-Interscience Hoboken NJ. .