Clinical Chemistry
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Clinical Chemistry 43: 608-614, 1997;
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Web of Science (4)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Sadler, W. A.
Right arrow Articles by Turner, J. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sadler, W. A.
Right arrow Articles by Turner, J. G.
Related Collections
Right arrow Laboratory Management
Right arrow Clinical Immunology
Right arrow Drug Monitoring and Toxicology
Right arrow Endocrinology and Metabolism
(Clinical Chemistry. 1997;43:608-614.)
© 1997 American Association for Clinical Chemistry, Inc.


Articles

A pragmatic approach to estimating total analytical error of immunoassays

William A. Sadlera, Murray H. Smith1, Lynda M. Murray and John G. Turner

Department of Nuclear Medicine, Christchurch Hospital, Private Bag, Christchurch, New Zealand.
1 Department of Mathematics & Statistics, University of Canterbury, Christchurch, New Zealand.
a Author for correspondence. Fax 64 3 364 0869.


   Abstract
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Several groups have recently commented on the need for more realistic information on analytical performance of laboratory tests. The term "total analytical error" is sometimes used in this context. However, differing opinions have been expressed on how best to obtain estimates of clinical assay error, as it would be perceived by clinicians. We suggest a pragmatic definition of total analytical error for immunoassays and describe our attempts to estimate it by simple designs in the internal quality-control process. We use results over 29 months from a total serum thyroxine RIA. The most important error sources were those related to calibration materials and operator effects, errors not usually captured by short-term or snapshot experiments.


Key Words: indexing terms: quality control • thyroxine RIA


   Introduction
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Krouwer (1) showed that classical method evaluations give an optimistic impression of assay performance because estimates of error from various individual sources (e.g., drift, bias, etc.) are generally not combined to give a picture of total error (i.e., the error that might be observed by a clinician). His paper tabulates several multifactor evaluation designs and describes models and techniques for combining results of these experiments into an estimate of total error. The emphasis on realism is especially relevant in view of interest in setting clinical performance goals for laboratory tests (e.g., ref. 2).

Unfortunately, the experimental approach described by Krouwer has some complications when viewed from a clinical laboratory perspective. First, some laboratories would be stretched to find resources to perform multifactor experiments at suitably regular intervals for each of many analytes. Resources in this context refer not only to reagents and other consumables but also to staff time. Second, snapshot experiments do not readily capture errors arising from reagent lot changes, and especially changes of calibration materials. Neither do they capture longitudinal effects that sometimes arise from aging of reagents or lack of stability of calibration materials. Third, the bias component of Krouwer's experimental designs cannot be properly applied to most immunoassays because of a lack of reference methods.

An additional complication is that Krouwer's total error models require quantification of nonspecific interferences. Unfortunately, many substances well known to interfere in immunoassays (autoantibodies, heterophile antibodies, drugs, etc.) cannot be easily related to assay error in a quantitative way. The effects are usually seen as a large departure from an expected result in a small proportion of specimens, but the amount of interferent is usually unknown. Lesser interference probably occurs frequently but is not detected. It seems reasonable to divide error sources into two components: those that systematically operate on all specimens (usually measurable) and those that relate only to particular specimens (interferences). Although unsatisfying, it must be recognized that some errors can be realistically estimated and some cannot. Thus, for pragmatic reasons, we restrict our perception of total analytical error to that operating on all specimens. Estimates must be immediately qualified by a statement such as: "some results may have (markedly) different error properties because of the presence of interfering substances in the specimen." The qualifying statement can be augmented from knowledge of which substances are known to interfere, the frequency of occurrence, and the direction of the error. Preanalytical factors might also be included in a qualifying statement.

The importance of Krouwer's message (1) cannot be overstated, but we concluded that practical adjustments must be made to the way total error is perceived and estimated (at least in the case of immunoassays). In mid-November 1993 we modified the internal quality-control (QC) scheme for a serum thyroxine (T4) RIA in an attempt to directly monitor total analytical error on a continuous basis.1 We summarize here the QC results accumulated over 29 months, describe some methods of extracting information from the data, and discuss the advantages and disadvantages of this particular approach to obtaining a realistic picture of assay performance.


   Materials and Methods
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
t4 ria
Assay calibration specimens are prepared in a serum pool depleted of endogenous thyroid hormone by six to seven 45-min incubations with Amberlite IRA 400 (Cl-) ion-exchange resin (Sigma, St. Louis, MO) at 45–50 °C. Free acid l-T4 (Sigma) is used to prepare a 350 nmol/L serum calibrator that is serially diluted with the zero serum to yield additional calibrators, 280, 210, 140, 70, and 35 nmol/L. The assay reaction mixture consists of (a) 25 µL of calibrator, QC or clinical sera; (b) 100 µL of 0.05 mol/L Tris-HCl buffer (pH 7.4) containing 100 µg/100 µL of RIA-grade bovine serum albumin (Sigma), 300 µg/100 µL of 8-anilino-1-naphthalene sulfonic acid (Sigma), 150pg/100 µL of [125I]T4 (product NEX-111; Du Pont, Wilmington, DE); and (c) 500 µL of magnetized T4 antibody (product 472299; Ciba Corning, Medfield, MA). The mixture is incubated at room temperature for 45 min before separation of antibody-bound and free [125I]T4 in a magnetic field. All measurements are made in duplicate with the exception that four replicates are used for the zero calibrator. Assay calibrators are stored at -30 °C and replaced at ~6-monthly intervals at a time that does not coincide with a change in any other reagent or QC specimen.

qc specimens
Four QC specimens labeled H (high), N (normal), LN (low normal), and L (low), with T4 concentrations of ~180, ~115, ~70, and ~30 nmol/L, respectively, are each prepared from one to five clinical specimens. Before introduction, preliminary target values are established for a new lot of QC specimens by assaying in six to 10 assay batches. QC specimens are stored at -30 °C in 400-µL aliquots. Each aliquot is used for five working days, with daily thaw/freeze cycles. QC specimens are replaced at ~6-monthly intervals, at a time that does not coincide with a change in any other reagent. QC specimens are assayed daily in the following order in cycles of four assay batches:

1. Start H, N, LN, L

2. End H, N, LN, L

3. Start N, LN, H, L

4. End N, LN, H, L

where Start or End means they are positioned immediately before, or immediately after, the clinical specimens (average batch size: 56 specimens during months 1–3 of this QC scheme, increasing to 69 specimens during months 27–29). The QC scheme includes a specimen carryover factor for the low specimen because specimens are dispensed for assay with a fixed glass capillary pipette (Model 325; Drummond Scientific, Broomall, PA) with three distilled water flushes between specimens. However, the LN:L and H:L ratios in the T4 RIA are only ~2:1 and ~6:1, respectively, and independent analysis of the carryover effect was negligible (best estimate <1:1300). Carryover effects are further reduced in this assay because they act only on the first of adjacent duplicates. The carryover factor was included in the QC design to establish it as standard practice in this laboratory, but it is also consistent with the philosophy of attempting to capture all error sources, even those of a trivial nature. It may have greater importance in other assay systems.

data analysis
Errors calculated from various subsets of data were summarized via the three-parameter variance function (3)(4) {sigma}1 (U) = (ß1 + ß2U)J, where {sigma}1 (U) denotes variance, U denotes concentration, and ß1, ß2, and J are the parameters. In evaluating error sources at each QC level, the linear model used (which incorporates additive effects) can be represented as


(1)
where Yijklmn is the concentration variable (T4); µ is the overall mean parameter; {alpha}iQ, i = 1–5, are the effects of five lots of QC specimens; {alpha}jC, j = 1–5, are the effects of five lots of calibration specimens; {alpha}kD, k = 1–2, are the effects of the within-batch QC specimen positioning (start vs end drift); {alpha}lA, l = 1–12, are the effects of 12 lots of antibody; {alpha}mL, m = 1–28, are the effects of 28 lots of [125I]T4; {alpha}nO, n = 1–7, are the effects of seven different operators; ß1i, i = 1–5, are linear regression coefficients for batch-to-batch (longitudinal) drift specific to each QC specimen; ß2i, i = 1–5, are corresponding quadratic regression coefficients; xi, i = 1–5, are elapsed time (days) since the introduction of each new QC lot; and {epsilon} is the unmodeled error variable. All but the final levels of each {alpha} parameter were entered into the data with 0/1 dummy variables that convert the model to a multiple regression (it is necessary to omit the final level in each case to make these parameters identifiable). Explanatory variables were subsequently orthogonalized by a standard method (5). Regression parameters were estimated with Statistica 5.0 for Windows (Statsoft, Tulsa, OK), although any multiple regression program would suffice. A detailed example with a small subset of the results and an MS-DOS computer program that orthogonalizes up to 80 explanatory variables can be obtained on request from the first author.


   Results
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
During a 29-month period, starting mid-November 1993, 19 of 610 T4 RIA batches (3.2%) failed internal QC. In 17 cases, at least three of the four QC results fell outside 95% confidence intervals in the same direction. In two cases the operator inadvertently prepared assay label with [125I]triiodothyronine (T3) instead of [125I]T4. These 19 batches of measurements were repeated. The rejected batches have no clinical relevance and are not considered further. Assuming a batch satisfies internal QC requirements, some individual results are rejected on the basis of poor replication of the duplicates (a 1% significance level is set during assay response error analysis). QC and clinical specimen results are treated identically and on this basis 36 of 2364 QC results (1.5%) were excluded from further analysis (corresponding poorly replicated clinical measurements were repeated). The important point is not the particular QC procedures used in this laboratory, but collection of QC data that accurately reflect whatever conditions and reporting rules are applied to clinical specimen results, i.e., QC results that realistically sample the clinical outputs of the assay. Fig. 1 shows the 2328 "in-control" QC results (means of duplicates) from 591 "in-control" batches. Changes of QC specimen lots are indicated by vertical lines. Closed arrows indicate four changes of assay calibrators that occurred during the 29 months.



View larger version (28K):
[in this window]
[in a new window]
 
Figure 1. QC specimen results (means of duplicates) at four concentrations H (high), N (normal), LN (low normal), and L (low) from 591 consecutive in-control T4 RIA batches over 29 months.

Changes of QC specimens are indicated by vertical lines. Horizontal lines indicate means and 95% confidence intervals for each QC lot. Closed arrows indicate where four changes of calibrators occurred. Two open arrows indicate statistically significant effects possibly associated with reagent lot changes (see text).

A relatively large number of factors potentially contribute to the observed variability. The list includes specimen and reagent pipetting errors, bound/free separation errors, signal measurement errors (this group is often summarized as within-assay variability), data reduction factors, four changes of calibration materials, 11 lot changes of antibody, 27 lot changes of [125I]T4, within-batch drift, specimen carryover (known to be trivial), effects arising from seven different operators who rotate through the laboratory and imaging sections of this department, plus any longitudinal effects that might arise from reagent aging. All these error sources are captured in the data in Fig. 1Up .

Figure 2 shows worst-case and best-case estimates of error. Worst-case error was estimated by calculating the mean and variance of each of the last four lots of QC results at each of the four QC concentrations (totaling 16 data points; see inset of Fig. 2 ). We also estimated the relation between variance and concentration (3)(4) to plot an imprecision profile. The initial QC lot (batches 1–93) was excluded from worst-case estimation because these data did not incorporate error from a calibration change. To estimate best-case error, each lot of QC results was subdivided into smaller runs, each run being terminated by any calibration or reagent lot change. Runs were further subdivided according to whether the QC specimen was assayed at the start or end of the batch. This yielded 296 means and variances, each based on relatively few results (see inset of Fig. 2 ). The effects of calibration and reagent lot changes, within-batch drift, and most of any longitudinal drift have been effectively factored out by the data subdivision. The CV difference between the imprecision profiles in Fig. 2 is ~0.6% (a relative difference of ~15%) and reflects the average effect of changes of calibration materials, changes of [125I]T4 and antibody, within-batch drift, and most of any longitudinal drift.



View larger version (26K):
[in this window]
[in a new window]
 
Figure 2. Between-assay imprecision profiles for a T4 RIA representing average worst-case conditions (16 means and variances estimated from a total of 1955 QC results; 1939 degrees-of-freedom) and average best-case conditions (296 means and variances estimated from 2289 QC results; 1993 degrees-of-freedom) over 29 months.

The effects of calibration and reagent lot changes, within-batch drift, and most of any longitudinal drift have been factored out of the best-case profile. The profiles reflect imprecision for duplicate measurement. The shaded areas are approximate 95% confidence intervals and the arrows indicate the assay reference range (55–140 nmol/L). The inset shows the data points and corresponding variance functions.

Internal QC has consistently shown systematic precision differences when comparing experienced and new staff. Fig. 3 is a further breakdown by operator: The best-case error for the best-performed operator is compared with the worst-case error for the least well-performed operator. The confidence intervals are somewhat wider in Fig. 3 because of fewer data. The best-case imprecision profile in Fig. 3 is probably an accurate representation of a careful short-term evaluation performed by a highly skilled operator using a single lot of all reagents. The worst-case profile is an example of what can happen thereafter. The CV difference in this case is ~2.5%, or approximately a twofold relative difference. We make three additional points about the data and summaries in Fig. 3 . First, the operator represented in the worst-case profile may not be the "least well-performed" operator in a literal sense, because operators who happen to work through many calibration or reagent lot changes are potentially disadvantaged relative to those who do not. Legitimate operator comparisons require either within-assay data or best-case data (extraneous factors eliminated). Second, the worst-case variance function in the inset of Fig. 3 does not appear to fit the data particularly well; in particular, it appears biased upwards. This operator appeared in only the second, fourth, and fifth QC lots and the corresponding means and variances were derived from 32, 5, and 15 replicated measurements respectively, i.e., the three data points at each QC concentration have markedly unequal "importance." The estimation method (3) weights the analysis appropriately, with the result that the fit to unequal data occasionally appears poor when judged by eye. Third, these profiles are averages over all work performed by two particular operators. Further data subdivisions (e.g., the worst of the "least well-performed" operator) might yield more extreme estimates, but this would be at the cost of increasing uncertainty (i.e., increasingly wider confidence intervals). Any longitudinal estimate of error is necessarily an "average" of some sort. The profiles in Fig. 3 are an attempt at a reasonable compromise between realism and reliability of the estimates.



View larger version (32K):
[in this window]
[in a new window]
 
Figure 3. Between-assay imprecision profiles for a T4 RIA representing worst-case conditions for the "least well-performed" operator (12 means and variances estimated from a total of 208 QC results; 196 degrees-of-freedom) and best-case conditions for the best-performed operator (86 means and variances estimated from 287 QC results; 201 degrees-of-freedom).

The effects of calibration and reagent lot changes, within-batch drift, and most of any longitudinal drift have been factored out of the best-case profile. The profiles represent an average for these two operators, over all work performed by them, and reflect imprecision for duplicate measurement. The shaded areas are approximate 95% confidence intervals and the arrows indicate the assay reference range (55–140 nmol/L). The inset shows the data points and corresponding variance functions.

In addition to simply summarizing error (Figs. 2Up and 3Up ), analysis ought to provide information on the contributions of individual error sources. A 64-parameter multiple regression, which incorporated changes of QC specimens, changes of calibrators, lot changes of antibody and [125I]T4, within-batch drift, operator effects, and linear and quadratic terms for longitudinal drift within each QC lot, was successful in explaining 58.5% (high QC) to 85.1% (low QC) of the variability shown in Fig. 1Up . Forward selection of parameters (F-to-enter = 2.0) reduced the model to between 13 and 18 parameters but still explained 53.8% (high QC) to 83.7% (low QC) of the variability. Unfortunately, many parameter combinations were highly correlated and factor effects were therefore almost impossible to interpret. Because the objective was simply to gain insights into error contributions, we orthogonalized the explanatory variables. This has the effect of rendering each factor independent of all others, i.e, each new parameter entered into the model addresses only the variability left unexplained by preceding parameters. This approach directly extracts the size of factor effects. The main disadvantage is that factor estimates are dependent on the order of appearance in the model; judgment is required on the part of the investigator. With reference to Fig. 1Up , a logical sequence seemed to be: (a) changes in QC specimens (i.e., initially extract effects caused by a change from one QC lot to the next); (b) calibration changes; and (c) within-batch drift. The remaining factors are intrinsically highly correlated. After first establishing that no operator contrasts were statistically significant (i.e., different operators did not appear to exert any systematic effects), we added parameters to account for lot changes of antibody, then lot changes of [125I]T4, then linear and quadratic terms within each QC lot to account for any longitudinal drift left unexplained. We repeated the analysis, reversing the antibody and [125I]T4 parameter ordering. Serial correlations of the residuals were not significantly different from zero (P >0.5 at all four QC concentrations) and normal probability plots showed little evidence of residual nonnormality. Statistically significant parameters (factor effects) are shown in Table 1 .


View this table:
[in this window]
[in a new window]
 
Table 1. Estimated multiple regression parameters that were statistically significant (P <0.01) at one or more of the four T4 QC concentrations (nmol/L).1

The QC specimen contrasts are of no particular interest. They merely indicate that concentrations changed significantly from lot to lot. The significant effects attributable to calibration changes are the important results of the analysis. The estimate of average within-batch drift was equivalent to ~1.7% at the low QC concentration but was <0.7% at the higher concentrations. When antibody parameters preceded [125I]T4 parameters, the analysis suggested a significant effect arising from the third lot change of antibody (Ciba Corning lots 24874 and 25944), indicated by open arrow 1 in Fig. 1Up . An opposite effect appeared to be associated with the ninth lot change of [125I]T4 (Du Pont lots AT81940 and AT92340), indicated by open arrow 2 in Fig. 1Up . Unfortunately, despite conscious efforts to stagger all reagent lot changes, the eighth lot change of [125I]T4 (Du Pont lots AT72240 and AT81940) occurred in the assay batch immediately before open arrow 1 in Fig. 1Up . When the analysis was repeated with [125I]T4 parameters entered first, the effect was now "explained" by the eighth [125I]T4 lot change and the antibody contrast was no longer significant. Note the similarity between parameter values for A3 and L8 in Table 1Up . This effect might have been due to a lot change of either antibody or [125I]T4, but we were unable to ascertain which.

The effects associated with changes of calibration materials are the difference between the mean of all results preceding the change and the mean of all results following the change. The first calibration change produced the largest effects (see parameter C1 in Table 1Up ), but at least some of this can be attributed to subsequent effects summarized by parameters A3, L8, and L9 in Table 1Up and indicated by the open arrows in Fig. 1Up . Interpretation of results (Table 1Up ) requires cognizance of the order in which parameters appeared in the model. Inspection of Fig. 1Up might suggest, for example, that longitudinal drift effects were operating in some sequences of results. None of the linear or quadratic drift parameters was statistically significant, but this almost certainly reflects the low importance attached to them in the parameter ordering, i.e., we chose to attach greater importance to reagent lot effects as explanatory variables. This approach to analyzing QC data probably merits further inquiry.

Finally, a potentially important factor that is sometimes overlooked is the suitability of the specimens. Table 2 contains data from an evaluation of a thyrotropin (TSH) assay (6). The authors acknowledged that the evaluation was conducted with a single lot of all reagents, and qualified their results by stating, "observed interassay precisions might increase if multiple lots of reagents are used." Equally important is the systematic precision difference exhibited by two "modified" control materials that were analyzed under identical conditions. Both materials are no doubt perfectly satisfactory for the primary role of QC, i.e., are today's results in control? However, in this case, one could assert that one or the other (and possibly both) fails to address the question: What is the day-to-day reproducibility of clinical specimen measurement?


View this table:
[in this window]
[in a new window]
 
Table 2. Interassay precision (n = 76) for TSH measured over 19 working days.1


   Discussion
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 
Many published estimates of assay performance have effectively captured best-case conditions, largely attributable to the constraints inherent in a preliminary evaluation. An estimate of the best that can be expected from an assay is a perfectly legitimate and useful indicator. However, as Krouwer (1) implied, corresponding information on worst-case performance is somewhat scarcer. Spencer et al. (7) recently echoed this general theme by calling for greater realism in evaluating TSH assays. They suggested repetitive analysis of several clinical specimens (or pools) over a clinically relevant period (6–8 weeks in the case of TSH), using at least two reagent lots, with random ordering of the specimens. We concur, but also suggest that the 6–8 week "experiment" be continued indefinitely.

Our experience over 21/2 years indicates that total error can be markedly variable. The specimen randomization suggested by Spencer et al. (7) should directly estimate the average total error over the time frame of the data. A case could be made, however, for aiming to capture the worst case of total error by using (QC) specimens in a more systematic way. This requires some thought about the error sources that apply to a particular assay system, then deployment of (QC) specimens in a way that deliberately captures the worst case (or approximately the worst case) of these errors. We considered the QC design given in Materials and Methods to be appropriate for a manual T4 RIA but we are certainly not averse to criticism or suggestions. Quite different designs might be appropriate for automated instruments to provide for alternative calibration strategies and random access measurement. One immediate advantage of systematic designs is that they are likely to reveal problems more quickly than a randomization, and particularly when the problem is one that develops during routine operation (e.g., a within-batch or within-day drift problem). They also provide data that are better suited to the summaries given in Figs. 2Up and 3Up . These data express the variable nature of total error. The size of the difference between best- and worst-case estimates is itself useful information on the robustness of a method.

Our presentation of results may give the impression that considerable statistical analysis is unavoidable. In fact, the worst-case data in the inset of Fig. 2Up involved only the calculation of 16 means and variances, hardly excessive given the time frame of the data. In practice, regular updates over shorter time frames (e.g., a moving 6-month window) would involve the calculation of perhaps 3–10 means and variances, depending on the number of specimens used. Results could be equally well displayed in either a table (e.g., Table 2Up ) or graphically (e.g., the inset of Fig. 2Up ). In most cases, existing methods of summarizing and presenting QC results would be perfectly adequate. More detailed statistical analyses, such as those attempted here, have the potential to be useful, but they are certainly not essential. The most important factor is the data: Do they realistically characterize the error sources that affect clinical results?

A comparison of Figs. 2Up and 3Up shows that operator effects have an important influence on total error of our T4 RIA. It would be easy to surmise that this is largely (or solely) a feature of manual assays. However, automated instruments may also give rise to genuine operator effects if some staff are less fastidious than others, e.g., failing to warm reagents to recommended temperatures, failing to properly follow priming, calibration, or maintenance procedures, etc. Spencer et al. (8) recently published strong evidence of an "environmental" factor in the performance of some automated instruments. In particular, performance by manufacturers was shown to be consistently better than performance in clinical laboratories. In addition, instruments (fully automated and otherwise) have been known to fail from time to time. Total error in the period leading up to a breakdown may differ systematically from the total error immediately after corrective action.

The most disconcerting finding from this longitudinal view of a T4 RIA was the poor performance in calibration. New calibrators are tested by assaying them in a routine clinical assay, then recalculating the assay with the new calibrators. This process is repeated over at least three batches to accumulate >200 paired clinical specimen results (i.e., new calibrator results vs current calibrator results). The paired results are subjected to regression analysis (9) with the pass/fail criteria that 95% confidence intervals of slope and intercept must enclose 1 and 0 respectively. All four regressions predicted the direction of the changes indicated in Table 1Up but underestimated the magnitude. These events occurred 6–7 months apart and, at the time, small lot-to-lot differences, apparently within sampling error of the relation y = x, did not seem to be a serious problem. A calculation comparing the means of the first 20 and last 20 results in each QC lot, then summing up the differences, suggested that assay results have drifted upwards by ~6% at the high QC concentration, increasing to ~18% at the low QC concentration. The assay is probably not out of control in the usual sense of the term, but the failure to maintain tighter internal consistency was greatly disappointing. A reevaluation of our methods of preparing and testing calibration specimens is clearly required. The summaries in Figs. 2Up and 3Up capture errors during ~6-monthly QC runs, and we believe them to be realistic estimates of error from a clinical monitoring perspective. It is not yet clear to us how to calculate overall error (i.e., incorporating the long-term upward drift) in a way that expresses its effect on diagnostic utility. We received no queries or complaints from clinicians during the last 21/2 years and it is therefore unclear whether the long-term drift has improved or worsened the relation between assay results and the stated clinical reference range. Estimation of a new reference range is also clearly required. The lack of a reference method continues to be a problem for many immunoassays.

Overall, the total analytical error (as defined in this work) for most clinical specimens falls within the limits defined in Fig. 3Up . Alternatively, the worst-case imprecision profile in Fig. 2Up is a reliable estimate of average total error for most specimens. These statements apply only to clinical monitoring situations and must be qualified by the long-term drift factor mentioned in the previous paragraph. A small number of specimens (<1 in 1500) produced clearly discordant overestimation of T4, invariably shown to be due to the presence of T4 binding substances (presumably autoantibodies). Other interfering substances, or lesser amounts of binding substances, are almost certainly exerting effects, but of insufficient magnitude to be detected (this may not be a problem from a monitoring perspective if the amount of interferent remains roughly constant over the clinical surveillance period). Current opinion (2) suggests that total analytical error should not exceed one-half of the intraindividual biological variation. This limits the increase in biological variation due to analytical errors to ~10%. On this basis, maximum total error for T4 assays has been estimated as 3.8% (10), 5.5% (11), 2.4% (12), 2.5% (13), and 1.8% (14). Taking 3% as a reasonable consensus value, it is clear from Fig. 2Up that our T4 RIA fails to achieve this particular clinical goal on average, although it may occasionally come close (Fig. 3Up ). Total error <=3% is a difficult target for current T4 assays.

The multifactor experiments detailed by Krouwer (1) are a very efficient means of detecting and measuring factor effects and seem to us to be ideally suited to either short-term preliminary evaluations, or for investigating problems that may arise during routine operation. Unfortunately, estimates of total error that might be constructed from such data are likely to be optimistic in some assay systems. For example, operator and calibrator effects, important sources of error in our T4 RIA, require more extensive data than can be reasonably expected from a snapshot experiment. There are likely to be many different opinions or ideas on how realistic estimates of assay performance are best obtained. We favor internal QC because it is already standard practice for all analytes in all laboratories. The data flow is continuous and reflects genuine clinical working conditions. In most cases, existing methods of summarizing QC data would be perfectly adequate, but this need not exclude additional statistical analysis. We believe that there is scope for innovative QC that not only directly captures realistic performance data but also allows factor effects to be assessed on a continuous basis (albeit with less efficiency than designed experiments). Simplicity is also important; bench staff are usually not statisticians and often work under considerable pressure. Our staff are currently required only to glance at a wall chart to ascertain the position and ordering of QC specimens. Provided they are recognized, occasional position/ordering mistakes need have no adverse effect on any subsequent analyses. Although our current QC methods can doubtless be improved, any design that requires more than a glance at a wall chart (or similar) is probably unnecessarily complicated.

We used extensive QC data here simply because they happened to be available. Excellent estimates of total analytical error could be obtained from considerably fewer data. Publication of such estimates would be very useful follow-up to preliminary reports that may have demonstrated the best of an assay method.


   Footnotes
 
1 Nonstandard abbreviations: QC, quality control; T4, thyroxine; and TSH, thyrotropin.


   References
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
References
 

  1. Krouwer JS. Estimating total analytical error and its sources. Techniques to improve method evaluation. Arch Pathol Lab Med 1992;116:726-731. [Web of Science][Medline] [Order article via Infotrieve]
  2. Fraser CG, Petersen PH. Desirable standards for laboratory tests if they are to fulfill medical needs. Clin Chem 1993;39:1447-1455. [Abstract]
  3. Sadler WA, Smith MH. A reliable method of estimating the variance function in immunoassay. Comput Stat Data Anal 1986;3:227-239.
  4. Sadler WA, Smith MH. Use and abuse of imprecision profiles: some pitfalls illustrated by computing and plotting confidence intervals. Clin Chem 1990;36:1346-1350. [Abstract/Free Full Text]
  5. Draper NR, Smith H. Applied regression analysis 1966:156-158 John Wiley and Sons New York. .
  6. Ward G, White M, Hickman PE. Simple procedures can markedly enhance automated immunoassay performance. Am J Clin Pathol 1994;102:3-6. [Web of Science][Medline] [Order article via Infotrieve]
  7. Spencer CA, Takeuchi M, Kazarosyan M. Current status and performance goals for serum thyrotropin (TSH) assays. Clin Chem 1996;42:140-145. [Abstract/Free Full Text]
  8. Spencer CA, Takeuchi M, Kazarosyan M, MacKenzie F, Beckett GJ, Wilkinson E. Interlaboratory/intermethod differences in functional sensitivity of immunometric assays of thyrotropin (TSH) and impact on reliability of measurement of subnormal concentrations of TSH. Clin Chem 1995;41:367-374. [Abstract/Free Full Text]
  9. Bablok W, Passing H, Bender R, Schneider B. A general regression procedure for method transformation. J Clin Chem Clin Biochem 1988;26:783-790. [Web of Science][Medline] [Order article via Infotrieve]
  10. Statland BE, Winkel P, Killingsworth LM. Factors contributing to intraindividual variation of serum constituents: 6. Physiological day-to-day variation in concentrations of 10 specific proteins in sera of healthy subjects. Clin Chem 1976;22:1635-1638. [Abstract/Free Full Text]
  11. Williams GZ, Widdowson GM, Penton J. Individual character of variation in time-series studies of healthy people. 11. Differences in values for clinical chemical analytes in serum among demographic groups by age and sex. Clin Chem 1978;24:313-320.
  12. Costongs GMPJ, Janson PCW, Bas BM, Hermans J, van Wersch JWJ, Brombacher PJ. Short-term and long-term intra-individual variations and critical differences of clinical chemical laboratory parameters. J Clin Chem Clin Biochem 1985;23:7-16. [Web of Science][Medline] [Order article via Infotrieve]
  13. Browning MCK, Ford RP, Callaghan SJ, Fraser CG. Intra- and interindividual biological variation of five analytes used in assessing thyroid function: implications for necessary standards of performance and the interpretation of results. Clin Chem 1986;32:962-966. [Abstract/Free Full Text]
  14. Ricós C, Arbós MA. Quality goals for hormone testing. Ann Clin Biochem 1990;27:353-358.




This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Web of Science (4)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Sadler, W. A.
Right arrow Articles by Turner, J. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sadler, W. A.
Right arrow Articles by Turner, J. G.
Related Collections
Right arrow Laboratory Management
Right arrow Clinical Immunology
Right arrow Drug Monitoring and Toxicology
Right arrow Endocrinology and Metabolism


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS