|
|
||||||||
Articles |
1
Department of Clinical Chemistry, Vejle Sygehus, DK-7100 Vejle, Denmark.
a Author for correspondence. Fax + 45 65 41 19 11; e-mail phy{at}imbmed ou.dk.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
These principles of comparing analytical performance with performance criteria, however, have not been universally accepted, and recent publications have criticized the misuse of correlation coefficients (4) and overinterpretation of regression lines in method comparison (5)(6)(7). Bland and Altman (4) recommended the difference plot (or bias plot or residual plot) as an alternative approach for method comparison. On the abscissa they used the mean value of the methods to be compared, to avoid regression towards the mean, and on the ordinate they plotted the calculated difference between measurements by the two methods. They further estimated the mean and standard deviation of differences and displayed horizontal lines for the mean and for ±2 x the standard deviation. However, they missed the concept of a more objective criterion for acceptability. Recently, Hollis (5) has recommended difference plots as the only acceptable method for method comparison studies for publication in Annals of Clinical Biochemistry, but without specifying criteria for acceptability.
However, a few difference plots with evaluation of acceptability according to defined criteria have been published, e.g., in evaluation of estimated biological variation compared with analytical imprecision (8), and in external quality assessment of plasma proteins for the possibilities of sharing common reference intervals (9).
Maybe the scarcity of such publications is more a question of interpretation of the data by plotting than a strict choice between scatter-plot and difference plot, as discussed by Stöckl (10) recently. Investigators seem to rely too much on regression lines and r-values, without doing the equally important interpretation of the data points of the plot. This is becoming more and more disadvantageous with the increasing number of Reference Methods available for comparison with field methods, because in these cases, it is not a question of finding some relationships, but simply of judging the field method to be acceptable or not.
NCCLS has recently published guidelines for method comparison and bias estimation by using patients' samples (11), where both scatter-plots and bias plots are advised. The document also recommends plotting of single determinations as mean values and stresses the need of visual inspection of data. Further, comparison with performance criteria is recommended, but these criteria are not specified and they are not used in the graphical interpretation. Recently, Houbouyan et al. (12) used ratio plots in their validation protocol of analytical hemostasis systems, where they used a preset, but arbitrarily chosen, acceptance limit of inaccuracy of 15%.
In the following, we will use the difference plot (or bias plot) in combination with simple statistics for the principal judgment of the identity or acceptability of a field method. The difference plot makes it easier to apply the concept; in principle, however, the same evaluations could be performed for a scatter-plot in relation to the line of identity (y = x).
The aim of this contribution is to pay attention to the hypothesis of identity and the concept of acceptable analytical quality in method comparison, especially when one of the methods is a Reference Method.
| basic considerations |
|---|
|
|
|---|
Further characteristics of the methods are the costs, the complexity, the equipment, the time used for production of the results, and so forththe field method generally being cheap and well suited for routine work and the Reference Method usually being applicable only in few competent and specially equipped laboratories.
The two approaches to the comparison are:
1. Identity. The results from the field method do not deviate from the Reference Method results by more than the inherent imprecision of both methods.
2. Analytical quality specifications. The results from the field method do not deviate from the Reference Method results by more than the acceptance specifications do from the analytical goals.
In both cases the null hypothesis is that the measured differences for all samples are zero. In the first case, the acceptance limits are defined by the inherent analytical imprecisions, and in the latter case the acceptance limits are defined by the analytical quality specifications.
By assuming the ideal situation, the theoretical limits for testing the
null hypothesis can be drawn in a difference plot, i.e., a mean
difference [mean(
)] equal to 0, and and a standard deviation of
differences [
(
)] calculated from the imprecisions of the two
methods. The ultimate hypothesis, then, is that the points are
distributed within these bounds. If they fall outside of these bounds,
the hypothesis is rejected. Alternatively, the performance
characteristics are tested against defined analytical quality
specifications.
| acceptance limits defined by inherent analytical imprecision |
|---|
|
|
|---|
2(
) =
2F +
2R (the theoretical
values for the two
methods being estimated independently for the two methods of the
comparison, or estimated from replicate measurements during the
comparison). When means of duplicates are used, then
should be
divided by
.
When the two methods are identical, the expectation is that ~68% of
differences will be distributed symmetrically around 0 within 0 ±
1
(
), and 95% will be within 0 ± 1.96
(
). If the
distribution of differences fits these criteria, then it is not
possible to find any difference between results from the two methods
within the inherent imprecision. If this is not the case, then the
methods are not identical.
Before the measured points are plotted on any figure, the hypothesis
can be illustrated in a difference plot, as shown in Fig. 1
A for the example of S-creatinine. The horizontal line
y = 0 illustrates the hypothesis of
xid = 0; the other lines indicate
within 0 ± 1
(
) for 68% and within 0 ± 1.96
(
)
for 95% of the differences, respectively. The outer lines indicate the
95% prediction interval for the expected distribution of points. In
this example,
F = 3.10 and
R = 0.50 and,
therefore,
(
) = 3.15 µmol/L.
|
It is educational to describe the hypothesis before plotting the points, because most investigators will start interpretation of the points in the form of functional relationships and thereby forget about the hypothesis.
With the hypothesis in mind, one can now plot the data points as shown
in Fig. 1B
. The data points are generated for Reference Method values
between 50 and 150 µmol/L and computer-simulated gaussian-distributed
differences based on mean(
) = -0.5 µmol/L and
(
) = 3.00
µmol/L; from these simulated data the calculated mean(d) was
-0.84 µmol/L and s(d) was 3.27 µmol/L. The data points are
distributed roughly according to the hypothesis, with 16 points (70%)
within the 0 ± 1
(
) and 21 points (91%) within 0 ±
2
(
), leaving 2 points (9%) outside the limits; because these two
points seem not to deviate too much from the general distribution, the
conclusion could be that the finding of just 2 points outside (and
close to) the 95% prediction interval is expected and acceptable and,
therefore, that the field method is indistinguishable from the
Reference Method within the analytical imprecision, so the evaluation
can stop. Statistically, the mean difference can be evaluated by a
t-test and the distribution of differences by an
F-test.
This approach is very narrow, and some uncertainty related to unknown
factors may be taken into account. Such variations could be the
variation between the two tubes/vials with serum from the same
individual or the underestimation of
F. Therefore,
possible additional sources of uncertainty always should be taken into
account when appropriate. However, the design of a method comparison
has to be carefully planned so as to exclude additional uncertainties.
Further, any addition of "acceptable" uncertainty should be well
thought through and handled with caution.
For the experienced scientist, interpreting the difference plot is
easy. A more objective criterion, however, for graphical validation of
the distribution of points, and especially the more extreme points, is
to apply the concept of tolerance intervals, where the standard deviate
(z or c) is substituted for by a tolerance
factor, k, with a value dependent on the percentage of
points (here 95%) and the confidence with which this percentage should
be obtained. The k-value is determined by the assumptions
about the new distribution, whether the mean or the standard deviation
is unknown, or both. The k-value further depends on the
number of points, n (15)(16). Although we have
a hypothesis about mean difference and standard deviation, these
figures are unknown in practice, so the
k7 for unknown mean and standard
deviation may be the most relevant to use. For n = 23 and 95%
confidence for 95% of the points, the tolerance factor,
k7, is 2.67 and the tolerance interval is 0
± 2.67
. This is illustrated in Fig. 1C
, where all points are
distributed within the chosen tolerance interval.
The present approach of theoretical expected distribution is compared
with the approach of Bland and Altman (4) in Fig. 1D
by
inserting the new lines determined by mean(d) ± 2 s(d) estimated from
the data points.
The difference between the present concept and the Bland and Altman
concept is clear from Fig. 1D
. Accordingly, we illustrate the 95%
prediction interval before any data points are applied, in contrast to
Bland and Altman, who simply illustrate the statistics of the points.
In this example the mean values are clearly different, whereas the
standard deviations are rather close to each other.
To illustrate the relations to xy plots, we first add to
Fig. 1B
11 points (triangles), as shown in Fig. 2
(top). The specimens producing these points are assumed to
contain some "nonspecific" components [in the S-creatinine
example, perhaps the specimens are from diabetics, where (e.g.) glucose
could result in nonspecific reactions by the Jaffe methods]. In the
difference plot they separate clearly from the other points and the
difference and standard deviation change to + 0.16 and 4.08 µmol/L,
respectively. In the xy plot (Fig. 2
, bottom), where the
points should be related to the line of identity (y =
x), it is difficult to see the difference between the two
sets of points. If we turn to calculation of r-values,
r decreases from 0.993 to 0.991, the slope of the regression
line changes from 1.020 to 1.025, and the intercept changes from -2.54
to -1.17 µmol/L.
|
The information from difference plots and xy plots (when using the line of identity) is the same, but it is easier to expand the differences in the difference plot and the calculations of variances are simpler, whereas the 45° angle in the xy plot makes comparable calculations more difficult. This is also seen from Westgard et al. (2)(3), where the simple calculations of (e.g.) bias is easier to interpret in combination with figures.
| acceptance limits defined by goals for analytical quality |
|---|
|
|
|---|
0.5 CVwithin-subject
variation as proposed by Cotlove et al. (21), and
that for analytical bias is |Banalytical |
0.25
CVtotal biological variation (19). The EGE-Lab
concept accepts both a maximum bias and a maximum imprecision
simultaneously; the EQA-Organizer concept describes a functional
relationship between the two in the form of a maximum allowable
combination of imprecision and bias.
The latter is close to the original concept of Gowans et al.
(19), defining the acceptable analytical percentage bias
for S-creatinine as 2.8% (25) when imprecision is
negligible. According to the EGE-Lab concept
(23)(24), however, both a bias of 2.8% and an
imprecision of 2.2% are acceptable simultaneously. This means that for
single determinations 0 ± ( bias + 1.65 x imprecision) is
acceptable (26); i.e., 95% of the single points must lie
within the limits of 0 ± (2.8% + 1.65 x 2.2%) = 0 ±
6.4%, as illustrated in Fig. 3
(left). This criterion is fulfilled as shown in Fig. 3
(middle). The concept, however, is one-sided, i.e., is valid only in
one direction. Therefore, the standard deviation of differences should
be judged against the imprecision criterion separately. The standard
deviation of the field method is 3.1 µmol/L and the CV = 3.9%,
which exceeds the imprecision specification of 2.2%.
|
Figure 3
(right) illustrates an example of acceptable analytical
imprecision and bias. The CV is 2.1% and the estimated bias (mean
difference) is +1.3 µmol/L (95% confidence interval, 0.62.0
µmol/L). Note that the mean difference is different from 0 but is
acceptable according to the criterion.
For the purpose of method comparison, the value for maximum allowable bias might be expanded because of the uncertainty (confidence interval) of the Reference Method. This cannot be seen from the actual comparison, but because the Reference Method is allowed to have some uncertainty, then this must be allowed also for the field method. We propose using the factor 1.2, in light of a recent concept that requires for Reference Methods a total error of <0.2 times that of the routine method (27). In the present case, the acceptance limits for bias should thus be 0 ± 3.36%, as also used in the example below.
| two examples utilizing comparison data |
|---|
|
|
|---|
Practical example 1.
Based on the duplicate analyses
performed, analytical imprecision is calculated within the interval
50150 µmol/L for both the Reference Method (HPLC) and the field
method, giving
R = 0.568 µmol/L and
F =
0.791 µmol/L. When means of duplicates are used for the comparison,
the calculated theoretical
(
) should be divided by
,
giving an expected distribution of 95% (the 95% prediction interval)
of the points within 0 ± 2
(
) = 0 ± 1.4 µmol/L.
Further, the expanded allowable bias of 3.36% is used for the
analytical quality specifications according to both EGE-Lab and
EQA-Organizers concepts.
The graphical evaluations are shown in Fig. 4
(top). Here, the calculated distribution is considerably broader
than the one assumed from the analytical point of view, whereas the
measured mean bias is within the limits of acceptance of the bias
because the mean (-1.6) and 95% confidence interval (-0.5 to
-2.7 µmol/L) are within the 3.36% allowable bias. Further, the
points are distributed within the total EGE-Lab criteria with only one
real outlier. The difference between the present concept of 95%
prediction interval and the Bland and Altman description of the actual
points is clear from the Figure
.
|
At first glance, the field method should be acceptable, with CV = 1%
and bias <3.36%, but the distribution of points (Fig. 4
, top) reveals
a much broader distribution (CV = 5%), which emphasizes an unknown
uncertainty. The problem is further underlined by the adding of single
determinations to the measurements in the Figure
, illustrating the
reproducibility of the individual differences. This uncertainty is
close to 5% and may originate not only from vial-to-vial variation but
also from aberrant-sample bias, whether from nonspecific reactions or
interference in the field method. Thus the difference plot and
calculation of the standard deviation of differences are tools to
disclose aberrant-sample bias in field methods.
Fuentes-Arderiu and Fraser (30) have proposed that the combined effects of imprecision and interference should be used in the concept of specifications for imprecision, but the problem has not been dealt with in either the EGE-Lab or EQA-Organizer concepts.
Practical example 2.
In the other example (Fig. 4
, bottom), all points are displaced from the acceptance area, showing a
considerable mean bias (~20 µmol/L) and also considerable
uncertainty from aberrant-sample bias. The calculated imprecision,
based on assays of duplicates, was ~1%, but the difference plot
reveals the errors clearly.
The CLIA criteria are much wider than the European recommendations, but
as total error specifications are easy to apply in the plot. For
concentrations <175 µmol/L the acceptable deviation is ±26.5
µmol/L (±0.3 mg/L), ±15% at higher concentrations. The upper line
of the CLIA criterion is shown in Fig. 4
(bottom). Because 7 of the 54
points are outside the line, the method is also unacceptable from the
standpoint of proficiency testing.
| general discussion |
|---|
|
|
|---|
As long as the purpose is to find the best functional relationship between two methods in order to correct one with the other, then an xy plot and calculation of the regression line with an estimation of the scatter via sy|x may be most relevant, with visual inspection of the scatter and a residual plot. The present task, however, is not to find a functional relationship between two methods, but to judge a field method in relation to a Reference Method.
When a field method is compared with a Reference Method for acceptability according to certain criteria, whether analytical or biological, then the visual inspection of all data is essential. Whether one uses a simple xy plot or a difference plot is not critical, as long as the area of interest is expanded and the single points are assessed according to the hypothesis of identity between the two methods. The hypothesis is that the measured values are identical (and not that they are unrelated, which is the basic hypothesis for correlation studies), which means that the hypothesis is described in an xy plot as the line of identity (y = x) and in the difference plot as the line y = 0. When a ratio plot is more appropriate than a difference plot, e.g., when analytical CV is close to being constant, the same evaluations can be performed.
The advantages of difference plots (and ratio plots) are keeping the
hypothesis of identity in mind and the ease of expanding the difference
ordinate according to the investigator's purpose. The power of the
graphical illustration in Figs. 1A
and 3
(left) lies in the simplicity
and the clear definition of the hypotheses.
For the experienced interpreter, most situations can be evaluated by visual inspection of the plots, whether a difference plot or an xy plot. If more objective criteria are wanted, calculations of mean difference with confidence intervals are a powerful tool, as is a table of k7-values1 for estimation of tolerance intervals.
Most important, however, is the inspection of the distribution of the
difference points, especially when samples expected to have matrix
effects are marked, and calculation of the standard deviation of
differences. When this exceeds the estimated analytical imprecision, it
is an indication of aberrant-sample bias. In principle, the
r-value from correlation between x and
y reflects this, but in practice the r-value is
insensitive, as illustrated in Fig. 2
. Further, calculation of the
standard deviation of differences gives an estimate of aberrant-sample
bias compared with the theoretical imprecision. The same information
can be obtained from the sy|x
estimate from regression analysis.
The plotting of single determinations can give information about imprecision and aberrant-sample bias as well and, if needed, a functional curve can be calculated and drawn. In this context we mention that replicate measurements always should be performed in this type of comparison, and that specimens should be stored for evaluation of possible outliers.
Krouwer and Monti (29) presented a graphical method for evaluation of laboratory assays (a mountain plot). They computed the percentile for each ranked difference between the two methods, and by "turning" at the 50th percentile produced a histogram-like function (the mountain). This method is relevant for detecting large infrequent errors (differences) but lacks the aspect of concentration relationship. These investigators, therefore, recommend use of their plot together with difference plots. Introduction of analytical quality specifications in the mountain plots may be useful in method evaluations.
Bland and Altman have pointed to the presentation of difference plots, where they recommend mean values of both methods on the abscissa (4); however, the risk of regression towards the mean is negligible in studies where field methods are compared with Reference Methods, because the results from the Reference Method (used on the abscissa) are assumed to have negligible error. They further calculate and present the standard deviation of the measured data, which is relevant information but not related to the hypothesis of identity.
This form of graphical testing of the hypothesis of identity has been used for biological data (8), where the standard deviation of measured differences was compared with the analytical imprecision. A more stringent method of testing measured values from field methods against target values has been applied for plasma proteins in The Nordic Protein Project (9). Here, the target values of serum pools were assigned from the European Community Bureau of Certified Reference Material, CRM 470 (31), by the method recommended by IFCC (32), and were used for the abscissa in a plot with acceptance lines according to acceptable bias (9) and with the measurements illustrated by the mean difference with a 90% confidence interval.
Another graphical method of prediction of single measurements of clinical data has been published for serial measurements of International Normalized Ratios of prothrombin times (33). Here, the differences between consecutive measurements in patients were compared with the expected variation estimated from patients under steady-state conditions. The abscissa was used for the latest result, resulting in a regression towards the mean, which was considered acceptable for the purpose (i.e., not to investigate correlations). The usefulness of the nomogram was improved by adding vertical lines for the therapeutic interval.
The goals used for acceptance are relevant for the use of common reference intervals (19)(20) and have been recommended by two European groups (23)(24)(25), with different consequences for the acceptance. This problem is related to the phenomenon of aberrant-sample bias (matrix effects) and has not been fully clarified by the proposed goal from either EGE-Lab or EQA-Organizers; Fuentes-Arderiu and Fraser (30), however, have proposed that the aberrant-sample bias be included in the precision goal (30). Other analytical goals have been postulated (17)(18)(34) and may be relevant for other evaluations. Goals based on biological data, however, are general in nature (related to common reference intervals and monitoring of patients) and are not restricted to a specific clinical application.
The CLIA criteria for S-creatinine are ±0.3 mg/dL (3 g/L, or 26.5 µmol/L) or ±15% for concentrations >175 µmol/L (28). These are criteria for total error, but they are very wide compared with the European recommendations. The CLIA criteria are intended for use with proficiency testing (and therefore need to be wide), whereas the European recommendations are so-called educational criteria, which relate directly to the desirable performance criteria for optimum monitoring of patients and the sharing of common reference intervals within geographical areas with populations that are homogeneous for the quantity of analyte. The criteria proposed by Ehrmeyer et al. (35) for minimum intralaboratory performance characteristics to pass CLIACV <33% and bias <20%, proficiency testing criteriacan be applied as well as the European criteria, but for the latter the total error will be determining. The CLIA criteria may be applied even better in difference plots, given the total error concept.
The validity of the analytical conclusions of this type of evaluation relies on the Reference Method chosen. It must be correct and specific and so forth, with negligible imprecision; otherwise, the conclusions of the comparison will be weakened according to any possible flaws of the Reference Method.
In conclusion, we find the visual inspection of plots to be essential for method comparison. When a field method is compared with a Reference Method, the hypothesis of identity within analytical imprecision or within stated analytical quality specifications should be applied. Both xy plots and difference plots are useful, but we find the difference plot is easier to handle and interpret and facilitates the calculations of uncertainty.
| Footnotes |
|---|
2 Laboratorium voor Analytische Chemie, Faculteit Farmaceutische Wetenschappen, Universiteit Gent, Harelbekestraat 72, B-9000 Gent, Belgium.
1 5 Some k7-values for 95% tolerance factors for the 95% interval: 3.38 for n = 10; 2.75 for n = 20; 2.55 for n = 30; 2.38 for n = 50; 2.30 for n = 70; and 2.23 for n = 100. ![]()
| References |
|---|
|
|
|---|
The following articles in journals at HighWire Press have cited this article:
![]() |
W S Waring, L E Evans, and C T Kirkpatrick Glycolysis inhibitors negatively bias blood glucose measurements: potential impact on the reported prevalence of diabetes mellitus J. Clin. Pathol., July 1, 2007; 60(7): 820 - 923. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Olsson, H. Vanderstichele, N. Andreasen, G. De Meyer, A. Wallin, B. Holmberg, L. Rosengren, E. Vanmechelen, and K. Blennow Simultaneous Measurement of {beta}-Amyloid(1-42), Total Tau, and Phosphorylated Tau (Thr181) in Cerebrospinal Fluid by the xMAP Technology Clin. Chem., February 1, 2005; 51(2): 336 - 345. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Krouwer Setting Performance Goals and Evaluating Total Analytical Error for Diagnostic Assays Clin. Chem., June 1, 2002; 48(6): 919 - 927. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Dewitte, C. Fierens, D. Stockl, and L. M. Thienpont Application of the Bland-Altman Plot for Interpretation of Method-Comparison Studies: A Critical Investigation of Its Practice Clin. Chem., May 1, 2002; 48(5): 799 - 801. [Full Text] [PDF] |
||||
![]() |
D. G. Altman and J. M. Bland Commentary on Quantifying Agreement between Two Methods of Measurement Clin. Chem., May 1, 2002; 48(5): 801 - 802. [Full Text] [PDF] |
||||
![]() |
T. Ohashi, M. Yamaki, C. S. Pandav, M. G. Karmarkar, and M. Irie Simple Microplate Method for Determination of Urinary Iodine Clin. Chem., April 1, 2000; 46(4): 529 - 536. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. H. Flowers and J. D. Cook Dried Plasma Spot Measurements of Ferritin and Transferrin Receptor for Assessing Iron Status Clin. Chem., October 1, 1999; 45(10): 1826 - 1832. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Linnet Necessary Sample Size for Method Comparison Studies Based on Regression Analysis Clin. Chem., June 1, 1999; 45(6): 882 - 894. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Q. Wei, V. P. Chu, A. R. Craig, J. E. Duffy, D. M. Obzansky, D. Kilgore, I. S. Masulli, C. M. Sanders, and J. C. Thompson Automated Homogeneous Immunoassay for Gentamicin on the Dimension Clinical Chemistry System Clin. Chem., March 1, 1999; 45(3): 388 - 393. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Linnet Limitations of the Paired t-Test for Evaluation of Method Comparison Data Clin. Chem., February 1, 1999; 45(2): 314 - 315. [Full Text] [PDF] |
||||
![]() |
I. Millard, E. Degrave, M. Philippe, and J.-L. Gala Detection of intracellular antigens by flow cytometry: comparison of two chemical methods and microwave heating Clin. Chem., November 1, 1998; 44(11): 2320 - 2330. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Stockl, K. Dewitte, and L. M. Thienpont Validity of linear regression in method comparison studies: is it limited by the statistical model or the quality of the analytical input data? Clin. Chem., November 1, 1998; 44(11): 2340 - 2346. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Andersen, P. H. Petersen, O. Blaabjerg, J. Hangaard, and C. Hagen Evaluation of growth hormone assays using ratio plots Clin. Chem., May 1, 1998; 44(5): 1032 - 1038. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. M. Thienpont, J. E. Van Nuwenborg, and D. Stockl Intrinsic and routine quality of serum total potassium measurement as investigated by split-sample measurement with an ion chromatography candidate reference method Clin. Chem., April 1, 1998; 44(4): 849 - 857. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |