Clinical Chemistry 43: 608-614, 1997;
(Clinical Chemistry. 1997;43:608-614.)
© 1997 American Association for Clinical Chemistry, Inc.
A pragmatic approach to estimating total analytical error of immunoassays
William A. Sadlera,
Murray H. Smith1,
Lynda M. Murray and
John G. Turner
Department of Nuclear Medicine, Christchurch Hospital, Private Bag, Christchurch, New Zealand.
1
Department of Mathematics & Statistics, University of Canterbury, Christchurch, New Zealand.
a Author for correspondence. Fax 64 3 364 0869.
 |
Abstract
|
|---|
Several groups have recently commented on the need for more realistic
information on analytical performance of laboratory tests. The term
"total analytical error" is sometimes used in this context.
However, differing opinions have been expressed on how best to obtain
estimates of clinical assay error, as it would be perceived by
clinicians. We suggest a pragmatic definition of total analytical error
for immunoassays and describe our attempts to estimate it by simple
designs in the internal quality-control process. We use results over 29
months from a total serum thyroxine RIA. The most important error
sources were those related to calibration materials and operator
effects, errors not usually captured by short-term or snapshot
experiments.
Key Words: indexing terms: quality control thyroxine RIA
 |
Introduction
|
|---|
Krouwer (1) showed that classical method
evaluations give an optimistic impression of assay performance because
estimates of error from various individual sources (e.g., drift, bias,
etc.) are generally not combined to give a picture of total error
(i.e., the error that might be observed by a clinician). His paper
tabulates several multifactor evaluation designs and describes models
and techniques for combining results of these experiments into an
estimate of total error. The emphasis on realism is especially relevant
in view of interest in setting clinical performance goals for
laboratory tests (e.g., ref. 2).
Unfortunately, the experimental approach described by Krouwer has some
complications when viewed from a clinical laboratory perspective.
First, some laboratories would be stretched to find resources to
perform multifactor experiments at suitably regular intervals for each
of many analytes. Resources in this context refer not only to reagents
and other consumables but also to staff time. Second, snapshot
experiments do not readily capture errors arising from reagent lot
changes, and especially changes of calibration materials. Neither do
they capture longitudinal effects that sometimes arise from aging of
reagents or lack of stability of calibration materials. Third, the bias
component of Krouwer's experimental designs cannot be properly applied
to most immunoassays because of a lack of reference methods.
An additional complication is that Krouwer's total error models
require quantification of nonspecific interferences. Unfortunately,
many substances well known to interfere in immunoassays
(autoantibodies, heterophile antibodies, drugs, etc.) cannot be easily
related to assay error in a quantitative way. The effects are usually
seen as a large departure from an expected result in a small proportion
of specimens, but the amount of interferent is usually
unknown. Lesser interference probably occurs frequently but is not
detected. It seems reasonable to divide error sources into two
components: those that systematically operate on all
specimens (usually measurable) and those that relate only to
particular specimens (interferences). Although unsatisfying,
it must be recognized that some errors can be realistically estimated
and some cannot. Thus, for pragmatic reasons, we restrict our
perception of total analytical error to that operating on
all specimens. Estimates must be immediately qualified by a
statement such as: "some results may have (markedly) different error
properties because of the presence of interfering substances in the
specimen." The qualifying statement can be augmented from knowledge
of which substances are known to interfere, the frequency of
occurrence, and the direction of the error. Preanalytical factors might
also be included in a qualifying statement.
The importance of Krouwer's message (1) cannot be
overstated, but we concluded that practical adjustments must be made to
the way total error is perceived and estimated (at least in the case of
immunoassays). In mid-November 1993 we modified the internal
quality-control (QC) scheme for a serum thyroxine (T4) RIA
in an attempt to directly monitor total analytical error on a
continuous basis.1
We summarize here the QC results accumulated over
29 months, describe some methods of extracting information from the
data, and discuss the advantages and disadvantages of this particular
approach to obtaining a realistic picture of assay performance.
 |
Materials and Methods
|
|---|
t4 ria
Assay calibration specimens are prepared in a serum pool depleted
of endogenous thyroid hormone by six to seven 45-min incubations with
Amberlite IRA 400 (Cl-) ion-exchange resin (Sigma,
St. Louis, MO) at 4550 °C. Free acid l-T4
(Sigma) is used to prepare a 350 nmol/L serum calibrator that is
serially diluted with the zero serum to yield additional calibrators,
280, 210, 140, 70, and 35 nmol/L. The assay reaction mixture consists
of (a) 25 µL of calibrator, QC or clinical sera;
(b) 100 µL of 0.05 mol/L Tris-HCl buffer (pH 7.4)
containing 100 µg/100 µL of RIA-grade bovine serum albumin (Sigma),
300 µg/100 µL of 8-anilino-1-naphthalene sulfonic acid (Sigma),
150pg/100 µL of [125I]T4 (product NEX-111;
Du Pont, Wilmington, DE); and (c) 500 µL of magnetized
T4 antibody (product 472299; Ciba Corning, Medfield, MA).
The mixture is incubated at room temperature for 45 min before
separation of antibody-bound and free
[125I]T4 in a magnetic field. All
measurements are made in duplicate with the exception that four
replicates are used for the zero calibrator. Assay calibrators are
stored at -30 °C and replaced at ~6-monthly intervals at a time
that does not coincide with a change in any other reagent or QC
specimen.
qc specimens
Four QC specimens labeled H (high), N (normal), LN (low normal),
and L (low), with T4 concentrations of ~180, ~115,
~70, and ~30 nmol/L, respectively, are each prepared from one to
five clinical specimens. Before introduction, preliminary target values
are established for a new lot of QC specimens by assaying in six to 10
assay batches. QC specimens are stored at -30 °C in 400-µL
aliquots. Each aliquot is used for five working days, with daily
thaw/freeze cycles. QC specimens are replaced at ~6-monthly
intervals, at a time that does not coincide with a change in any other
reagent. QC specimens are assayed daily in the following order in
cycles of four assay batches:
1. Start H, N, LN, L
2. End H, N, LN, L
3. Start N, LN, H, L
4. End N, LN, H, L
where Start or End means they are positioned immediately before, or
immediately after, the clinical specimens (average batch size: 56
specimens during months 13 of this QC scheme, increasing to 69
specimens during months 2729). The QC scheme includes a specimen
carryover factor for the low specimen because specimens are dispensed
for assay with a fixed glass capillary pipette (Model 325; Drummond
Scientific, Broomall, PA) with three distilled water flushes between
specimens. However, the LN:L and H:L ratios in the T4
RIA are only ~2:1 and ~6:1, respectively, and independent analysis
of the carryover effect was negligible (best estimate <1:1300).
Carryover effects are further reduced in this assay because they act
only on the first of adjacent duplicates. The carryover factor was
included in the QC design to establish it as standard practice in this
laboratory, but it is also consistent with the philosophy of attempting
to capture all error sources, even those of a trivial nature. It may
have greater importance in other assay systems.
data analysis
Errors calculated from various subsets of data were summarized via
the three-parameter variance function (3)(4)
1
(U) = (ß1 +
ß2U)J, where
1
(U) denotes
variance, U denotes concentration, and ß1,
ß2, and J are the parameters. In evaluating error sources
at each QC level, the linear model used (which incorporates additive
effects) can be represented as
 | (1) |
where Yijklmn is the concentration variable
(T4); µ is the overall mean parameter;
iQ, i = 15, are
the effects of five lots of QC specimens;
jC, j = 15, are
the effects of five lots of calibration specimens;
kD, k = 12, are
the effects of the within-batch QC specimen positioning (start vs end
drift);
lA, l =
112, are the effects of 12 lots of antibody;
mL, m = 128,
are the effects of 28 lots of [125I]T4;
nO, n = 17, are
the effects of seven different operators; ß1i,
i = 15, are linear regression coefficients for
batch-to-batch (longitudinal) drift specific to each QC specimen;
ß2i, i = 15, are
corresponding quadratic regression coefficients;
xi, i = 15, are elapsed time
(days) since the introduction of each new QC lot; and
is the
unmodeled error variable. All but the final levels of each
parameter were entered into the data with 0/1 dummy variables that
convert the model to a multiple regression (it is necessary to omit the
final level in each case to make these parameters identifiable).
Explanatory variables were subsequently orthogonalized by a standard
method (5). Regression parameters were estimated with
Statistica 5.0 for Windows (Statsoft, Tulsa, OK), although any multiple
regression program would suffice. A detailed example with a small
subset of the results and an MS-DOS computer program that
orthogonalizes up to 80 explanatory variables can be obtained on
request from the first author.
 |
Results
|
|---|
During a 29-month period, starting mid-November 1993, 19 of 610
T4 RIA batches (3.2%) failed internal QC. In 17
cases, at least three of the four QC results fell outside 95%
confidence intervals in the same direction. In two cases the operator
inadvertently prepared assay label with
[125I]triiodothyronine (T3) instead of
[125I]T4. These 19 batches of measurements
were repeated. The rejected batches have no clinical relevance and are
not considered further. Assuming a batch satisfies internal QC
requirements, some individual results are rejected on the basis of poor
replication of the duplicates (a 1% significance level is set during
assay response error analysis). QC and clinical specimen results are
treated identically and on this basis 36 of 2364 QC results (1.5%)
were excluded from further analysis (corresponding poorly replicated
clinical measurements were repeated). The important point is not the
particular QC procedures used in this laboratory, but collection of QC
data that accurately reflect whatever conditions and reporting rules
are applied to clinical specimen results, i.e., QC results that
realistically sample the clinical outputs of the assay. Fig. 1
shows the 2328 "in-control" QC results (means of
duplicates) from 591 "in-control" batches. Changes of QC specimen
lots are indicated by vertical lines. Closed arrows indicate four
changes of assay calibrators that occurred during the 29 months.

View larger version (28K):
[in this window]
[in a new window]
|
Figure 1. QC specimen results (means of duplicates) at four
concentrations H (high), N (normal), LN (low normal), and L (low) from
591 consecutive in-control T4 RIA batches over 29 months.
Changes of QC specimens are indicated by vertical lines.
Horizontal lines indicate means and 95% confidence
intervals for each QC lot. Closed arrows indicate where four
changes of calibrators occurred. Two open arrows indicate
statistically significant effects possibly associated with reagent lot
changes (see text).
|
|
A relatively large number of factors potentially contribute to the
observed variability. The list includes specimen and reagent pipetting
errors, bound/free separation errors, signal measurement errors (this
group is often summarized as within-assay variability), data reduction
factors, four changes of calibration materials, 11 lot changes of
antibody, 27 lot changes of [125I]T4,
within-batch drift, specimen carryover (known to be trivial), effects
arising from seven different operators who rotate through the
laboratory and imaging sections of this department, plus any
longitudinal effects that might arise from reagent aging. All these
error sources are captured in the data in Fig. 1
.
Figure 2
shows worst-case and best-case estimates of error. Worst-case
error was estimated by calculating the mean and variance of each of the
last four lots of QC results at each of the four QC concentrations
(totaling 16 data points; see inset of Fig. 2
). We also estimated the
relation between variance and concentration
(3)(4) to plot an imprecision profile. The
initial QC lot (batches 193) was excluded from worst-case estimation
because these data did not incorporate error from a calibration change.
To estimate best-case error, each lot of QC results was subdivided into
smaller runs, each run being terminated by any calibration or reagent
lot change. Runs were further subdivided according to whether the QC
specimen was assayed at the start or end of the batch. This yielded 296
means and variances, each based on relatively few results (see inset of
Fig. 2
). The effects of calibration and reagent lot changes,
within-batch drift, and most of any longitudinal drift have been
effectively factored out by the data subdivision. The CV difference
between the imprecision profiles in Fig. 2
is ~0.6% (a relative
difference of ~15%) and reflects the average effect of
changes of calibration materials, changes of
[125I]T4 and antibody, within-batch drift,
and most of any longitudinal drift.

View larger version (26K):
[in this window]
[in a new window]
|
Figure 2. Between-assay imprecision profiles for a T4
RIA representing average worst-case conditions (16
means and variances estimated from a total of 1955 QC results; 1939
degrees-of-freedom) and average best-case conditions (296
means and variances estimated from 2289 QC results; 1993
degrees-of-freedom) over 29 months.
The effects of calibration and reagent lot changes, within-batch drift,
and most of any longitudinal drift have been factored out of the
best-case profile. The profiles reflect imprecision for duplicate
measurement. The shaded areas are approximate 95%
confidence intervals and the arrows indicate the assay
reference range (55140 nmol/L). The inset shows the data
points and corresponding variance functions.
|
|
Internal QC has consistently shown systematic precision differences
when comparing experienced and new staff. Fig. 3
is a further breakdown by operator: The best-case error for the
best-performed operator is compared with the worst-case error for the
least well-performed operator. The confidence intervals are somewhat
wider in Fig. 3
because of fewer data. The best-case imprecision
profile in Fig. 3
is probably an accurate representation of a careful
short-term evaluation performed by a highly skilled operator using a
single lot of all reagents. The worst-case profile is an example of
what can happen thereafter. The CV difference in this case is ~2.5%,
or approximately a twofold relative difference. We make three
additional points about the data and summaries in Fig. 3
. First, the
operator represented in the worst-case profile may not be the "least
well-performed" operator in a literal sense, because operators who
happen to work through many calibration or reagent lot changes are
potentially disadvantaged relative to those who do not. Legitimate
operator comparisons require either within-assay data or best-case data
(extraneous factors eliminated). Second, the worst-case variance
function in the inset of Fig. 3
does not appear to fit the data
particularly well; in particular, it appears biased upwards. This
operator appeared in only the second, fourth, and fifth QC lots and the
corresponding means and variances were derived from 32, 5, and 15
replicated measurements respectively, i.e., the three data points at
each QC concentration have markedly unequal "importance." The
estimation method (3) weights the analysis appropriately,
with the result that the fit to unequal data occasionally appears poor
when judged by eye. Third, these profiles are averages over all work
performed by two particular operators. Further data subdivisions (e.g.,
the worst of the "least well-performed" operator) might yield more
extreme estimates, but this would be at the cost of increasing
uncertainty (i.e., increasingly wider confidence intervals). Any
longitudinal estimate of error is necessarily an "average" of some
sort. The profiles in Fig. 3
are an attempt at a reasonable compromise
between realism and reliability of the estimates.

View larger version (32K):
[in this window]
[in a new window]
|
Figure 3. Between-assay imprecision profiles for a T4
RIA representing worst-case conditions for the "least
well-performed" operator (12 means and variances estimated from a
total of 208 QC results; 196 degrees-of-freedom) and best-case
conditions for the best-performed operator (86 means and variances
estimated from 287 QC results; 201 degrees-of-freedom).
The effects of calibration and reagent lot changes, within-batch drift,
and most of any longitudinal drift have been factored out of the
best-case profile. The profiles represent an average
for these two operators, over all work performed by them, and reflect
imprecision for duplicate measurement. The shaded areas are
approximate 95% confidence intervals and the arrows
indicate the assay reference range (55140 nmol/L). The
inset shows the data points and corresponding variance
functions.
|
|
In addition to simply summarizing error (Figs. 2
and 3
), analysis ought
to provide information on the contributions of individual error
sources. A 64-parameter multiple regression, which incorporated changes
of QC specimens, changes of calibrators, lot changes of antibody and
[125I]T4, within-batch drift, operator
effects, and linear and quadratic terms for longitudinal drift within
each QC lot, was successful in explaining 58.5% (high QC) to 85.1%
(low QC) of the variability shown in Fig. 1
. Forward selection of
parameters (F-to-enter = 2.0) reduced the model to between 13 and
18 parameters but still explained 53.8% (high QC) to 83.7% (low QC)
of the variability. Unfortunately, many parameter combinations were
highly correlated and factor effects were therefore almost impossible
to interpret. Because the objective was simply to gain insights into
error contributions, we orthogonalized the explanatory variables. This
has the effect of rendering each factor independent of all others, i.e,
each new parameter entered into the model addresses only the
variability left unexplained by preceding parameters. This approach
directly extracts the size of factor effects. The main disadvantage is
that factor estimates are dependent on the order of
appearance in the model; judgment is required on the part of the
investigator. With reference to Fig. 1
, a logical sequence seemed to
be: (a) changes in QC specimens (i.e., initially extract
effects caused by a change from one QC lot to the next); (b)
calibration changes; and (c) within-batch drift. The
remaining factors are intrinsically highly correlated. After first
establishing that no operator contrasts were statistically significant
(i.e., different operators did not appear to exert any
systematic effects), we added parameters to account for lot
changes of antibody, then lot changes of
[125I]T4, then linear and quadratic terms
within each QC lot to account for any longitudinal drift left
unexplained. We repeated the analysis, reversing the antibody and
[125I]T4 parameter ordering. Serial
correlations of the residuals were not significantly different from
zero (P >0.5 at all four QC concentrations) and normal
probability plots showed little evidence of residual nonnormality.
Statistically significant parameters (factor effects) are shown in
Table 1
.
View this table:
[in this window]
[in a new window]
|
Table 1. Estimated multiple regression parameters that were
statistically significant (P <0.01) at one or more of
the four T4 QC concentrations
(nmol/L).1
|
|
The QC specimen contrasts are of no particular interest. They merely
indicate that concentrations changed significantly from lot to lot. The
significant effects attributable to calibration changes are the
important results of the analysis. The estimate of average within-batch
drift was equivalent to ~1.7% at the low QC concentration but was
<0.7% at the higher concentrations. When antibody parameters preceded
[125I]T4 parameters, the analysis
suggested a significant effect arising from the third lot change of
antibody (Ciba Corning lots 24874 and 25944), indicated by open arrow 1
in Fig. 1
. An opposite effect appeared to be associated with the ninth
lot change of [125I]T4 (Du Pont lots AT81940
and AT92340), indicated by open arrow 2 in Fig. 1
. Unfortunately,
despite conscious efforts to stagger all reagent lot changes, the
eighth lot change of [125I]T4 (Du Pont lots
AT72240 and AT81940) occurred in the assay batch immediately before
open arrow 1 in Fig. 1
. When the analysis was repeated with
[125I]T4 parameters entered first, the effect
was now "explained" by the eighth
[125I]T4 lot change and the antibody contrast
was no longer significant. Note the similarity between parameter values
for A3 and L8 in Table 1
. This effect might have been due to a lot
change of either antibody or [125I]T4, but we
were unable to ascertain which.
The effects associated with changes of calibration materials are the
difference between the mean of all results preceding the change and the
mean of all results following the change. The first calibration change
produced the largest effects (see parameter C1 in Table 1
), but at
least some of this can be attributed to subsequent
effects summarized by parameters A3, L8, and L9 in Table 1
and
indicated by the open arrows in Fig. 1
. Interpretation of results
(Table 1
) requires cognizance of the order in which parameters appeared
in the model. Inspection of Fig. 1
might suggest, for example, that
longitudinal drift effects were operating in some sequences of results.
None of the linear or quadratic drift parameters was statistically
significant, but this almost certainly reflects the low importance
attached to them in the parameter ordering, i.e., we chose
to attach greater importance to reagent lot effects as explanatory
variables. This approach to analyzing QC data probably merits further
inquiry.
Finally, a potentially important factor that is sometimes overlooked is
the suitability of the specimens. Table 2
contains data from an evaluation of a thyrotropin (TSH) assay
(6). The authors acknowledged that the evaluation was
conducted with a single lot of all reagents, and qualified their
results by stating, "observed interassay precisions might increase if
multiple lots of reagents are used." Equally important is the
systematic precision difference exhibited by two "modified" control
materials that were analyzed under identical conditions. Both
materials are no doubt perfectly satisfactory for the primary role of
QC, i.e., are today's results in control? However, in this case, one
could assert that one or the other (and possibly both) fails to address
the question: What is the day-to-day reproducibility of
clinical specimen measurement?
 |
Discussion
|
|---|
Many published estimates of assay performance have effectively
captured best-case conditions, largely attributable to the constraints
inherent in a preliminary evaluation. An estimate of the best that can
be expected from an assay is a perfectly legitimate and useful
indicator. However, as Krouwer (1) implied, corresponding
information on worst-case performance is somewhat scarcer. Spencer et
al. (7) recently echoed this general theme by calling for
greater realism in evaluating TSH assays. They suggested repetitive
analysis of several clinical specimens (or pools) over a clinically
relevant period (68 weeks in the case of TSH), using at least two
reagent lots, with random ordering of the specimens. We concur, but
also suggest that the 68 week "experiment" be continued
indefinitely.
Our experience over 21/2 years indicates that total error can be
markedly variable. The specimen randomization suggested by Spencer et
al. (7) should directly estimate the average total error
over the time frame of the data. A case could be made, however, for
aiming to capture the worst case of total error by using (QC) specimens
in a more systematic way. This requires some thought about the error
sources that apply to a particular assay system, then deployment of
(QC) specimens in a way that deliberately captures the worst case (or
approximately the worst case) of these errors. We considered the QC
design given in Materials and Methods to be
appropriate for a manual T4 RIA but we are certainly not
averse to criticism or suggestions. Quite different designs might be
appropriate for automated instruments to provide for alternative
calibration strategies and random access measurement. One immediate
advantage of systematic designs is that they are likely to reveal
problems more quickly than a randomization, and particularly when the
problem is one that develops during routine operation (e.g.,
a within-batch or within-day drift problem). They also provide data
that are better suited to the summaries given in Figs. 2
and 3
. These
data express the variable nature of total error. The size of the
difference between best- and worst-case estimates is itself useful
information on the robustness of a method.
Our presentation of results may give the impression that considerable
statistical analysis is unavoidable. In fact, the worst-case data in
the inset of Fig. 2
involved only the calculation of 16 means and
variances, hardly excessive given the time frame of the data. In
practice, regular updates over shorter time frames (e.g., a moving
6-month window) would involve the calculation of perhaps 310 means
and variances, depending on the number of specimens used. Results could
be equally well displayed in either a table (e.g., Table 2
) or
graphically (e.g., the inset of Fig. 2
). In most cases, existing
methods of summarizing and presenting QC results would be perfectly
adequate. More detailed statistical analyses, such as those attempted
here, have the potential to be useful, but they are certainly not
essential. The most important factor is the data: Do
they realistically characterize the error sources that affect clinical
results?
A comparison of Figs. 2
and 3
shows that operator effects have an
important influence on total error of our T4 RIA. It
would be easy to surmise that this is largely (or solely) a feature of
manual assays. However, automated instruments may also give rise to
genuine operator effects if some staff are less fastidious than others,
e.g., failing to warm reagents to recommended temperatures, failing to
properly follow priming, calibration, or maintenance procedures, etc.
Spencer et al. (8) recently published strong evidence of
an "environmental" factor in the performance of some automated
instruments. In particular, performance by manufacturers was shown to
be consistently better than performance in clinical laboratories.
In addition, instruments (fully automated and otherwise) have been
known to fail from time to time. Total error in the period leading up
to a breakdown may differ systematically from the total error
immediately after corrective action.
The most disconcerting finding from this longitudinal view of a
T4 RIA was the poor performance in calibration. New
calibrators are tested by assaying them in a routine clinical assay,
then recalculating the assay with the new calibrators. This process is
repeated over at least three batches to accumulate >200 paired
clinical specimen results (i.e., new calibrator results vs current
calibrator results). The paired results are subjected to regression
analysis (9) with the pass/fail criteria that 95%
confidence intervals of slope and intercept must enclose 1 and 0
respectively. All four regressions predicted the direction of the
changes indicated in Table 1
but underestimated the magnitude. These
events occurred 67 months apart and, at the time, small lot-to-lot
differences, apparently within sampling error of the relation
y = x, did not seem to be a serious problem.
A calculation comparing the means of the first 20 and last 20 results
in each QC lot, then summing up the differences, suggested that assay
results have drifted upwards by ~6% at the high QC concentration,
increasing to ~18% at the low QC concentration. The assay is
probably not out of control in the usual sense of the term, but the
failure to maintain tighter internal consistency was greatly
disappointing. A reevaluation of our methods of preparing and testing
calibration specimens is clearly required. The summaries in Figs. 2
and 3
capture errors during ~6-monthly QC runs, and we believe them to be
realistic estimates of error from a clinical monitoring
perspective. It is not yet clear to us how to calculate overall error
(i.e., incorporating the long-term upward drift) in a way that
expresses its effect on diagnostic utility. We received no
queries or complaints from clinicians during the last 21/2 years
and it is therefore unclear whether the long-term drift has
improved or worsened the relation between assay results and the stated
clinical reference range. Estimation of a new reference range is also
clearly required. The lack of a reference method continues to be a
problem for many immunoassays.
Overall, the total analytical error (as defined in this work) for
most clinical specimens falls within the limits defined
in Fig. 3
. Alternatively, the worst-case imprecision profile in Fig. 2
is a reliable estimate of average total error for most
specimens. These statements apply only to clinical monitoring
situations and must be qualified by the long-term drift factor
mentioned in the previous paragraph. A small number of specimens (<1
in 1500) produced clearly discordant overestimation of T4,
invariably shown to be due to the presence of T4 binding
substances (presumably autoantibodies). Other interfering substances,
or lesser amounts of binding substances, are almost certainly exerting
effects, but of insufficient magnitude to be detected (this may not be
a problem from a monitoring perspective if the amount of interferent
remains roughly constant over the clinical surveillance period).
Current opinion (2) suggests that total analytical error
should not exceed one-half of the intraindividual biological variation.
This limits the increase in biological variation due to analytical
errors to ~10%. On this basis, maximum total error for
T4 assays has been estimated as 3.8% (10),
5.5% (11), 2.4% (12), 2.5%
(13), and 1.8% (14). Taking 3% as a
reasonable consensus value, it is clear from Fig. 2
that our
T4 RIA fails to achieve this particular clinical goal on
average, although it may occasionally come close (Fig. 3
). Total error
3% is a difficult target for current T4 assays.
The multifactor experiments detailed by Krouwer (1) are a
very efficient means of detecting and measuring factor effects and seem
to us to be ideally suited to either short-term preliminary
evaluations, or for investigating problems that may arise during
routine operation. Unfortunately, estimates of total error that might
be constructed from such data are likely to be optimistic in some assay
systems. For example, operator and calibrator effects, important
sources of error in our T4 RIA, require more extensive data
than can be reasonably expected from a snapshot experiment. There are
likely to be many different opinions or ideas on how realistic
estimates of assay performance are best obtained. We favor internal QC
because it is already standard practice for all analytes in all
laboratories. The data flow is continuous and reflects genuine clinical
working conditions. In most cases, existing methods of summarizing QC
data would be perfectly adequate, but this need not exclude additional
statistical analysis. We believe that there is scope for innovative QC
that not only directly captures realistic performance data but also
allows factor effects to be assessed on a continuous basis (albeit with
less efficiency than designed experiments). Simplicity is also
important; bench staff are usually not statisticians and often work
under considerable pressure. Our staff are currently required only to
glance at a wall chart to ascertain the position and ordering of QC
specimens. Provided they are recognized, occasional position/ordering
mistakes need have no adverse effect on any subsequent analyses.
Although our current QC methods can doubtless be improved, any design
that requires more than a glance at a wall chart (or similar) is
probably unnecessarily complicated.
We used extensive QC data here simply because they happened to be
available. Excellent estimates of total analytical error could be
obtained from considerably fewer data. Publication of such estimates
would be very useful follow-up to preliminary reports that may have
demonstrated the best of an assay method.
 |
Footnotes
|
|---|
1 Nonstandard abbreviations: QC, quality control; T4, thyroxine; and TSH, thyrotropin. 
 |
References
|
|---|
-
Krouwer JS. Estimating total analytical error and its sources. Techniques to improve method evaluation. Arch Pathol Lab Med 1992;116:726-731.
[Web of Science][Medline]
[Order article via Infotrieve]
-
Fraser CG, Petersen PH. Desirable standards for laboratory tests if they are to fulfill medical needs. Clin Chem 1993;39:1447-1455.
[Abstract]
-
Sadler WA, Smith MH. A reliable method of estimating the variance function in immunoassay. Comput Stat Data Anal 1986;3:227-239.
-
Sadler WA, Smith MH. Use and abuse of imprecision profiles: some pitfalls illustrated by computing and plotting confidence intervals. Clin Chem 1990;36:1346-1350.
[Abstract/Free Full Text]
-
Draper NR, Smith H. Applied regression analysis 1966:156-158 John Wiley and Sons New York. .
-
Ward G, White M, Hickman PE. Simple procedures can markedly enhance automated immunoassay performance. Am J Clin Pathol 1994;102:3-6.
[Web of Science][Medline]
[Order article via Infotrieve]
-
Spencer CA, Takeuchi M, Kazarosyan M. Current status and performance goals for serum thyrotropin (TSH) assays. Clin Chem 1996;42:140-145.
[Abstract/Free Full Text]
-
Spencer CA, Takeuchi M, Kazarosyan M, MacKenzie F, Beckett GJ, Wilkinson E. Interlaboratory/intermethod differences in functional sensitivity of immunometric assays of thyrotropin (TSH) and impact on reliability of measurement of subnormal concentrations of TSH. Clin Chem 1995;41:367-374.
[Abstract/Free Full Text]
-
Bablok W, Passing H, Bender R, Schneider B. A general regression procedure for method transformation. J Clin Chem Clin Biochem 1988;26:783-790.
[Web of Science][Medline]
[Order article via Infotrieve]
-
Statland BE, Winkel P, Killingsworth LM. Factors contributing to intraindividual variation of serum constituents: 6. Physiological day-to-day variation in concentrations of 10 specific proteins in sera of healthy subjects. Clin Chem 1976;22:1635-1638.
[Abstract/Free Full Text]
-
Williams GZ, Widdowson GM, Penton J. Individual character of variation in time-series studies of healthy people. 11. Differences in values for clinical chemical analytes in serum among demographic groups by age and sex. Clin Chem 1978;24:313-320.
-
Costongs GMPJ, Janson PCW, Bas BM, Hermans J, van Wersch JWJ, Brombacher PJ. Short-term and long-term intra-individual variations and critical differences of clinical chemical laboratory parameters. J Clin Chem Clin Biochem 1985;23:7-16.
[Web of Science][Medline]
[Order article via Infotrieve]
-
Browning MCK, Ford RP, Callaghan SJ, Fraser CG. Intra- and interindividual biological variation of five analytes used in assessing thyroid function: implications for necessary standards of performance and the interpretation of results. Clin Chem 1986;32:962-966.
[Abstract/Free Full Text]
-
Ricós C, Arbós MA. Quality goals for hormone testing. Ann Clin Biochem 1990;27:353-358.