Clinical Chemistry AACC Online Job Center
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Clinical Chemistry 45: 882-894, 1999;
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Submit an electronic Letter to
the Editor about this paper
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (21)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Linnet, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Linnet, K.
Related Collections
Right arrow Laboratory Management
(Clinical Chemistry. 1999;45:882-894.)
© 1999 American Association for Clinical Chemistry, Inc.


Articles

Necessary Sample Size for Method Comparison Studies Based on Regression Analysis

Kristian Linnet

Laboratory of Clinical Biochemistry, Psychiatric University Hospital, Skovagervej 2, DK-8240 Risskov, Denmark. Fax 45 86170778; e-mail linnet{at}post7.tele.dk


   Abstract
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
Appendix 1
References
 
Background: In method comparison studies, it is of importance to assure that the presence of a difference of medical importance is detected. For a given difference, the necessary number of samples depends on the range of values and the analytical standard deviations of the methods involved. For typical examples, the present study evaluates the statistical power of least-squares and Deming regression analyses applied to the method comparison data.

Methods: Theoretical calculations and simulations were used to consider the statistical power for detection of slope deviations from unity and intercept deviations from zero. For situations with proportional analytical standard deviations, weighted forms of regression analysis were evaluated.

Results: In general, sample sizes of 40–100 samples conventionally used in method comparison studies often must be reconsidered. A main factor is the range of values, which should be as wide as possible for the given analyte. For a range ratio (maximum value divided by minimum value) of 2, 544 samples are required to detect one standardized slope deviation; the number of required samples decreases to 64 at a range ratio of 10 (proportional analytical error). For electrolytes having very narrow ranges of values, very large sample sizes usually are necessary. In case of proportional analytical error, application of a weighted approach is important to assure an efficient analysis; e.g., for a range ratio of 10, the weighted approach reduces the requirement of samples by >50%.

Conclusions: Estimation of the necessary sample size for a method comparison study assures a valid result; either no difference is found or the existence of a relevant difference is confirmed.© 1999 American Association for Clinical Chemistry


   Introduction
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
Appendix 1
References
 
A common task in the laboratory is to compare a new method with an established one to assess whether the new measurements are comparable with the existing ones. The question may be whether the new measurements are interchangeable with the existing ones from a clinical point of view, or whether proficiency testing demands are likely to be fulfilled. Various sources may be consulted to evaluate the differences that are considered of importance. Medically significant differences have been assessed on the basis of clinicians' points of view(1), biological variation (2), or combinations of these principles (3). For some analytes, e.g., cholesterol and other lipids, organizations have recommended specific analytical goals that take into account the medical use of these analytes(4)(5). In the context of external quality assessment, regulatory authorities have decided on analytical tolerance limits that should be achieved, e.g., CLIA 88 demands (6). On the basis of these kinds of sources, one may reach a conclusion concerning relevant critical differences that should be detected at one or more decision levels. A typical decision level might be the upper edge of the 95% reference interval. Other levels might be dictated by medical intervention limits, e.g., in the context of serum cholesterol concentrations.

The rational design of a method comparison study should take into account the relevant critical differences that should be detected at selected decision levels. Commonly, the measurements of two analytical methods are compared by a regression analysis procedure, which allows the detection of a possible constant systematic difference (intercept deviation from zero) and a proportional systematic difference (slope deviation from unity). The investigator should then consider whether the study design is likely to disclose these critical differences. Important factors in this context are the range of measurements, the analytical standard deviations (SDas)1 of the involved methods, and the number of samples. These factors determine the statistical power of a method comparison study, i.e., the ability of the data analysis procedure to verify the presence of a given systematic difference. In this study, some prototype situations in clinical chemistry are evaluated, and guidelines concerning necessary sample sizes are tabulated for typical cases. In situations with constant SDas, unweighted regression procedures are considered, i.e., ordinary least-squares regression analysis (OLR) and Deming regression analysis. For cases involving SDas that are proportional to the measurement level, the corresponding weighted regression procedures are primarily taken into account. In the Appendix, specific formulas are presented for the relationship between statistical power and sample size in regression analysis.


   Materials and Methods
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
Appendix 1
References
 
method comparison model
Taking into account that an analytical method measures analyte concentrations with some uncertainty, one must distinguish between the measured value (xi) and the target value (Xi) of a sample subjected to analysis by a given method. The latter is the mean result we would obtain if the given sample was measured an infinite number of times. The measured value is likely to deviate from the target value by some small random amount ({epsilon} or {delta}). For a given sample measured by two analytical methods, we have:


The dispersion of measured values around the target value depends on the SDa of the method. A linear relationship between the target values of the two methods is assumed:

To estimate {alpha}0 and ß correctly, a regression procedure taking errors in both x and y into account is preferable, e.g., the Deming method (7)(8)(9)(10). In this procedure, the sum of squared distances from measured sets of values (xi, yi) to the regression line is minimized at an angle determined by the ratio between the SDas of x and y. In Fig. 1 A, the symmetric case is illustrated with a regression slope of one and equal SDas for x and y. The most widely used regression procedure in method comparison studies, OLR, does not take errors in x into account and thus provides a downward biased slope estimate (7). In situations with a wide range of x values, this bias may be negligible and OLR may be used for estimation of slope and intercept. In OLR, the sum of squared distances from (xi, yi) to the line is minimized in the vertical direction (Fig. 1B ).



View larger version (9K):
[in this window]
[in a new window]
 
Figure 1. Schematic outline of distances from data points to the line in Deming regression (A) and outline of vertical distances from data points to the line in OLR (B).

In panel A, SDax and SDay are equal, and the slope is one.

Another methodological problem concerns the question of whether the SDas are constant. For most clinical chemical compounds, the SDas vary with the measurement level (11). In cases with a considerable range, i.e., a decade or more, this phenomenon should also be taken into account in the regression analysis. The Deming method should then be carried out as a weighted analysis, e.g., assuming proportional SDas (12). In the weighted modification of the Deming procedure, distances from (xi, yi) to the line are inversely weighted according to the squared SDas at a given level (Fig. 2 A). In the same way, least-squares regression analysis may also be carried out in a weighted modification (WLR), in which the distances from (xi, yi) to the line in the vertical direction are inversely weighted according to the squared SDa value (Fig. 2B ) (9)(13). The regression procedures, which are outlined in the Appendix, were performed using the program CBstat developed by the author. The program runs under Windows 95/98/NT.



View larger version (9K):
[in this window]
[in a new window]
 
Figure 2. Distances from data points to the line in weighted Deming regression assuming proportional SDas (A) and vertical distances from data points in WLR (B).

In panel A, SDax and SDay are assumed equal.

detection of systematic differences between methods
A systematic difference between two methods is identified if the estimated intercept differs significantly from zero or if the slope deviates significantly from 1. This is decided on the basis of t-tests:


SE(a0) and SE(b) are the standard errors of the estimated intercept a0 and slope b, respectively. For OLR and WLR, the standard errors are calculated from formulas; for the Deming and weighted Deming procedures, one may apply a computerized resampling principle called the jackknife procedure (Appendix) (12)(13)(14)(15). Notice that Latin letter symbols, e.g., a0 and b, denote sample estimates, whereas Greek letters are used for true, population values ({alpha}0 and ß in this case).

Having estimated a0 and b, it is possible to estimate the systematic difference between the methods, Dc, at a selected level Xc (Fig. 3 ):

Yestc is the estimated Y value at Xc. Notice that Dc refers to the systematic difference, i.e., the difference between target values, and so it is not a total error including random measurement errors.



View larger version (10K):
[in this window]
[in a new window]
 
Figure 3. Illustration of the systematic difference {Delta}c between two methods at a given level Xc according to the regression line.

The difference is a result of a constant systematic difference (intercept deviation from zero) and a proportional systematic difference (slope deviation from unity). The dotted line represents the diagonal Y = X.

In the planning phase of a method comparison study, one should consider the size of the medically significant difference or critical difference, {Delta}c, that should be detected at a given level Xc. One may then derive the needs for detecting critical sizes of {alpha}0 and slope deviation from unity (ß - 1):

Evaluations of slope and intercept deviations are presented in the following sections.

statistical power considerations in method comparison studies
Having decided from above the values of {alpha}0 and (ß - 1) that should be detected, the next step is to design the method comparison study appropriately. A relevant range of values should be included, i.e., covering the range of clinical interest. A uniform distribution of values over the range is preferable and is primarily supposed in the present evaluation. We next must consider the SDas of the methods. Two situations should be considered: the presence of constant and nonconstant SDas. The first possibility is of interest mainly for measurements involving a narrow range of values, e.g., in electrolyte methods. When ranges cover one or more decades, it is important to take into account that the SDa usually varies with the measurement level (11). Quite often a proportional relationship approximately applies in clinical chemistry, implicating that the analytical coefficient of variation (CVa) is approximately constant over the measurement range. Finally, specific values for the SDas or CVas of the methods should be assigned from available quality-control data. According to the formula presented below and explained in more detail in the Appendix, we now have the necessary information to plan the comparison study, i.e., to decide on the necessary number of samples.

A general, simplified formula for the approximation of the necessary sample size for detection of a difference {Delta} with regard to slope deviation from unity or intercept deviation from zero (16) is:

where c is a constant determining the standard error (c/{surd}N) of the estimated difference D, which corresponds to the true difference {Delta}. tp/2 depends on the significance level P (type I error) and is 1.96 (asymptotically) for P <=5%. t1 - q reflects the statistical power (1 - q), which is the probability of verifying a real difference {Delta}. The complement to the power is the type II error (q), which is the probability of overlooking a real difference {Delta}. For a traditional power level of 90%, t1 - q takes the value 1.28. Finally, the sample size is in principle inversely related to the squared difference {Delta}, i.e., if a given difference is halved, the sample size requirement is increased by a factor four. At small-to-moderate sample sizes, however, adjustment of the t value from the asymptotic value and the general impact of approximations disturb this relationship somewhat (Appendix).

The relationship between the null hypothesis situation of no difference and the alternative hypothesis of the presence of a real difference {Delta} is depicted schematically in Fig. 4 , which outlines the hypothetical situation corresponding to a set of repeated method comparison studies that yield observed differences D that are distributed around the true difference, which is zero under the null hypothesis of no difference and equal to {Delta} under the alternative hypothesis. The larger the sample size is, the more narrow the dispersions of observed differences around the true values are. Thus, for a given {Delta} and type I error, the power increases with the sample size.



View larger version (13K):
[in this window]
[in a new window]
 
Figure 4. Schematic illustration of distributions of differences D under the null hypothesis (H0) of no real difference and the alternative hypothesis (HA) of the presence of a real difference {Delta}.

The vertical dotted line indicates the limit of statistical significance. p is the type I error (5%), and 1 - q is the power (90%).


   Results
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
Appendix 1
References
 
The necessary sample sizes for a series of standard method comparison situations in clinical chemistry have been tabulated in Tables 1 and 2 . A type I error (significance level) of 5% and a power of 90% have been supposed. Table 1 concerns the situation with constant SDas over the measurement range, and Table 2 covers cases with proportional SDas.


View this table:
[in this window]
[in a new window]
 
Table 1. Sample size table for comparison of methods with constant SDas using Deming regression analysis.1


View this table:
[in this window]
[in a new window]
 
Table 2. Sample size table for comparison of methods with proportional SDas using weighted Deming regression analysis.1

constant SDas
Table 1Up covers intervals with ratios from 1.25 to 10 for the maximum value divided by the minimum value (range ratio = maximum value/minimum value). The other entry in Table 1Up is the standardized {Delta} value for slope or intercept. With regard to the slope, this value refers to the slope deviation from unity measured in CVa units:

Notice that although the SDa is constant, the CVa (expressed as a fraction) enters the formula, implying that the CVa is not constant over the measurement range. The CVa in the formula refers to the specific CVa value at the middle of the interval of interest, i.e., CVa = SDa/xm, where xm is the mean of the interval for the analytes (Appendix).

With regard to the intercept deviation from zero, the standardized {Delta} value is:

The standard situations presuppose that the analytical methods have identical SDas and that the analyte values are uniformly distributed over the intervals of interest. Notice that if duplicate measurements are carried out by each method, the SDa value for single measurements should be reduced by a factor of {surd}2. Table 1Up has only selected entry values. With regard to standardized deviations that are not covered, an approximate sample size may be obtained by inter- or extrapolation. At large sample sizes, squared inter- or extrapolation is reasonable, but for small sample sizes, this relationship is not exactly valid. Approximate interpolations can also be carried out for the tabulated range ratios. Given assumptions not covered in Table 1Up , estimation of N may be performed on the basis of formulas described in the Appendix. Moreover, one should consider adding some additional samples to take nonideal conditions into account, such as target value distributions that are not exactly uniform over the given interval (discussed later in the text). The tabulated values refer to application of Deming regression analysis. If OLR is applied, the required sample sizes are one-half the tabulated values. However, to apply OLR correctly, the x measurements should be without random measurement errors.

It is apparent from Table 1Up that the range of values is very important with regard to the required sample size for detection of a given standardized slope or intercept deviation. For the very narrow ranges characteristic of electrolyte measurement methods, detection of a standardized slope or intercept deviation equal to one may require >1000 samples. On the other hand, the sample size requirements may be rather modest for analytes with values dispersed over wider ranges.

In the next section, Table 1Up is used for evaluation of sample size requirements for a method comparison study of two electrolyte methods, i.e., a situation with a small range ratio and constant SDas.

planning a comparison of two potassium methods (constant SDas)
We first decide on the critical differences that should be detected. For convenience, we take as a basis the CLIA 88 rule of 0.5 mmol/L as the acceptable error throughout the measurement range. Notice that CLIA 88 rules relate to the total error in relation to a target value of a quality-control sample:

The factor 1.65 corresponds to a total error that assumes that 95% of the measurements are within the given limit.

We consider here decision levels of 3 and 6 mmol/L and suppose in the present example that the SDa is 0.09 mmol/L, which corresponds to a CVa of 2% at the mean (4.5 mmol/L) of the considered range. Thus, the systematic difference that should be detected is:


From the general formula:

we obtain at Xc = 3 mmol/L:

This corresponds to a need for detecting {alpha}0 equal to ± 0.35 mmol/L, if the systematic difference is ascribed to an intercept deviation. Relating the systematic difference to a slope deviation corresponds to a demand of detecting ß = 1.12 (3.35/3), or 0.88. Similarly, at the upper decision level of Xc = 6 mmol/L, we have again the limits ± 0.35 mmol/L for detection of {alpha}0, but now the demand for detecting a slope deviation has been sharpened to ß = 1.06 (6.35/6) or 0.94.

Let us now consider the various factors in the estimation of sample size. The first factor to consider is the measurement range of 3–6 mmol/L, i.e., a range ratio of 2. We suppose, as mentioned above, that both methods have constant SDas corresponding to a CVa of 2%, i.e., 0.09 mmol/L, at the middle of the range. If duplicate sets of measurements are taken, the SDa is reduced by a factor of {surd}2, to 0.06 mmol/L. If we assume duplicate sets of measurements, the CVa at the middle of the interval is 0.014. We are now able to convert the slope {Delta} value to a standardized value:

The next factor to consider is the regression procedure, in this case, Deming regression analysis with a significance level (type I error) of 5% and a statistical power of 90%. To get the necessary sample size, we consult Table 1Up and look under a range ratio of 2 and a standardized slope deviation of 4 and find the sample size, N = 41. When we use squared extrapolation, the sample size becomes 36 [41 x (4/4.3)2]. Notice that this value refers to 36 samples measured in duplicate with each method. If only single measurements are to be performed, the required number of samples would be doubled to 72.

Table 1Up also covers cases studied by the use of OLR. Under the given assumptions, the approximate sample size requirement for OLR is obtained by dividing the numbers by two, i.e., N = 18 in this case (see Appendix). However, a correct statistical analysis based on OLR requires that the x method is without analytical errors (SDax = 0).

For the intercept, we want to detect a deviation of ± 0.35 mmol/L, which may be converted to:

According to Table 1Up , the sample size requirement is N <40. By squared extrapolation, we obtain the approximate value N = 19 [40 x (4/5.8)2]. Thus, in this example, the sample size requirement with regard to testing for intercept deviation from zero is less demanding than that of testing for a critical slope deviation.

proportional SDas
Table 2Up covers application of weighted Deming regression analysis in situations with proportional SDas (constant CVas) and includes intervals with range ratios extending from 2 to 100. The standardized slope deviation here is:

and the standardized intercept deviation from zero is:

CVa should be expressed as a fraction, not a percentage, and xm is the midpoint of the interval of interest (Appendix). Application of Table 2Up presupposes that the analytical methods have identical CVas and that the analyte values are uniformly distributed over the intervals of interest. The CVa refers to single or duplicate sets of measurements as appropriate. Approximate sample sizes for the application of WLR, assuming CVax = 0, are derived by division of the sample size values by two.

With regard to the testing of slope deviations, the strong impact on sample size of the range of values is apparent. The required number of samples is of the same order of magnitude as for analogous situations with constant SDas. For example, detection of one standardized slope deviation with the given type I error and power requires >500 samples when the range ratio is 2, but only 37 samples for a range ratio of 50 or higher.

For deviations in the intercept, high sample size requirements are also present at low range ratios, e.g., 521 observations are necessary to detect one standardized deviation at a range ratio of 2, decreasing to ~70 at a range ratio of 5 and finally a negligible sample size at high range ratios. The required number of samples is the same order of magnitude as in analogous cases with constant SDas. Notice that intercept deviations here are standardized with regard to the CVa multiplied by xm, which corresponds to the SDa value in the constant case with the CVa computed at the mean of the interval.

For standardized deviations that are not covered, the approximate sample size may be obtained by squared inter- or extrapolation. In the next section, the sample size requirement is considered for a comparison of two glucose methods with proportional SDas.

planning a comparison of two glucose methods (proportional SDas)
We first decide on the critical differences that should be detected. For convenience, we again apply CLIA 88 rules, which correspond to {Delta} = 10% or 60 mg/L (6 mg/dL) at low levels. We consider a decision level corresponding to a fasting plasma glucose concentration of 1260 mg/L (126 mg/dL), which is a diagnostic limit for diabetes (17). Again we notice that CLIA 88 rules relate to the total error. Assuming proportional SDas with a CVa of 3% for both methods, we obtain the critical systematic difference as:

We have:

which corresponds to:

or

This relationship translates to a requirement of detecting {alpha}0 equal to ± 63 mg/L (± 6.3 mg/dL), if the systematic difference is related to an intercept deviation. Ascribing the systematic difference to a slope deviation implies a demand of detecting ß = 1.05 or 0.95.

Let us now consider the following conditions: a measurement range of 600-3000 mg/L (60–300 mg/dL), i.e., a range ratio of 5; the midpoint of the interval (xx) = 1800 mg/L (180 mg/dL); and for both methods, proportional SDas with CVas = 0.03, which means 0.02 for duplicate sets of measurements by each method (0.03/{surd}2). For the regression procedure, we use the weighted Deming regression analysis with a significance level (type I error) of 5% and a statistical power of 90%.

We standardize the slope deviation:

From the column corresponding to a range ratio of 5 in Table 2Up , we find the required sample size to be between 17 and 33 duplicate measurements for each method. Interpolation gives N = 25 [17 x (3/2.5)2]. If only single sets of measurements are performed, 50 samples are required. For WLR assuming CVax = 0, approximately one-half the number of samples are required (N = 13).

For the intercept, we want to detect a deviation of ± 63 mg/L (± 6.3 mg/dL), which may be converted to:

From Table 2Up , we obtain N = 24 by squared interpolation, i.e., close to the number necessary for testing the critical slope deviation.

Other decision levels might also be considered, e.g., the nonfasting plasma glucose concentration limit of 2000 mg/L (200 mg/dL) as being diagnostic for diabetes (17). In this proportional error model, the standardized slope deviation to be detected is the same at all decision levels, but the requirement for intercept detection varies with the level; therefore, the most demanding situations occur at low levels.

influence of target value distributions on sample size requirements
In the guidelines and examples outlined above, a basic assumption has been the uniform distribution of target values on the intervals characterized by a given range ratio. A uniform distribution of values throughout the range of interest is generally recommended in method comparison studies, but this ideal may not always be attained. If the comparison study is based on samples from a healthy population, the distributions of target values may be gaussian, e.g., for serum concentrations of electrolytes, or skewed. Furthermore, if samples are included from both healthy subjects and patients, skewed distributions usually arise. A common situation may be represented by a mixture of 75% of samples from healthy or nearly healthy subjects and 25% from diseased subjects, giving rise to a distribution skewed to the right. It is possible to outline some rough guidelines concerning these situations, which are common in real world examples.

For a gaussian distribution of target values, the major portion (95%) of the observations are located within ± 2 SD from the mean. Thus, the gaussian case may be compared to a situation with a uniform distribution spanning this interval. For the cases with constant SDas, the necessary sample sizes read from Table 1Up should be multiplied by a factor of 1.3 to cover equivalent cases with gaussian target value distributions. This factor applies to both the slope and the intercept. For skewed distributions with 75% of the target values located on the lower half of the intervals, the factor for the slope is 1.4. No general correction factor can be applied for the intercept, but the necessary sample sizes extend from 1 to 1.3 times those listed for uniform target value distributions, with the highest factor adjustments pertaining to the narrowest intervals.

For cases involving proportional SDas, the corrections factors are as follows. For gaussian distributions of target values, the sample size requirement as regard testing of the slope equals 1.3–1.5 times those listed in Table 2Up . For the intercept, 1–1.8 times the listed sample sizes are required. For the skewed target value distribution mentioned above, the correction factors range from 1.4 to 1.9 for the slope. For the intercept, the factors extend from 1.4 to 3. The highest correction factors for skewed distributions apply to the intervals with the highest range ratios.

application of unweighted forms of regression analysis to cases involving proportional SDas
According to current practices in method comparison studies, it is usual to apply unweighted forms of regression analysis, i.e., OLR and Deming analysis, although the SDas vary with the measurement level, for example corresponding to a proportional relationship (constant CVas). Thus, it is of interest to consider what happens in these situations. This subject has been addressed previously, but some supplementary aspects are considered here(9).

Basically, OLR provides unbiased estimates of slope and intercept if the SDa for x is zero, irrespective of whether the SDa for y is constant or varies with the measurement level. In the same way, the Deming procedure provides unbiased estimates of slope and intercept when the SDas vary, provided that their ratio is constant throughout the measurement range. This aspect is important and means that the estimates of slope and intercept generally are reliable in this frequently occurring situation. However, additional aspects are to be considered: the reliability of the associated statistical analysis, and the efficiency of the unweighted estimation procedures.

Two problems can occur when OLR is applied to real-life examples: the lack of consideration of measurement errors in x, and the variation of the SDa. The first problem is well known and most significant at low range ratios, in which cases a biased slope estimate arises (7)(9)(10). Some authors have recommended that OLR may be applied when the correlation coefficient exceeds 0.975 or 0.99(18)(19). In these cases, however, the bias of the slope estimate has a consequence in that the type I error for the statistical analysis increases, i.e., the null hypothesis is rejected too frequently. This increase may amount to several fold and may cause the null hypothesis of a slope equal to one to be rejected far more frequently than anticipated from the nominal level of 5%.

The presence of a proportional SDa for the y measurements also independently tends to increase the type I error for the test of slope deviation when the OLR procedure is used because the standard error of the slope is underestimated. The phenomenon is most pronounced for skewed target value distributions, in which cases the type I error may increase three- to fourfold, implying that the type I error becomes 15–20% compared with the nominal level of 5%. For uniform and gaussian target value distributions, the increase is up to 7.5% and 10%, respectively. Finally, the precision of slope and intercept estimations is lower than that provided by WLR in cases involving a proportional SDa for y measurements. For a range ratio of 10, 2.3 times as many observations are required for estimation of the slope with a given precision compared with the WLR procedure. For the intercept, the factor is 3.9.

In unweighted Deming analysis, the associated statistical analysis is only slightly perturbed in cases involving proportional SDas. For uniform and gaussian target value distributions, the type I error generally is unaffected. For the skewed target value distribution, the type I error may be slightly increased at sample sizes <100, but at higher sample sizes, the correct value of 5% is attained. These relationships refer to estimation of the standard error by the computerized jackknife principle as performed here (Appendix).

The major problem associated with application of the unweighted Deming analysis in cases involving proportional SDas is the suboptimality of the unweighted approach (Table 3 ). For uniform distributions with range ratios from 2 to 100, 1.2 to 3.7 times as many samples are needed to obtain the same precision of the slope estimate by the unweighted compared with the weighted approach. Thus, the larger the range ratio is, the more inefficient the unweighted method is. If one intends to apply the unweighted Deming procedure in a situation with proportional SDas, the sample size values of Table 2Up should be multiplied by the adjustment factors that may be derived from Table 3 .


View this table:
[in this window]
[in a new window]
 
Table 3. Comparison of sample sizes providing the same precision of slope and intercept estimates by unweighted and weighted Deming regression analysis in situations with proportional SDas.

For the intercept, 1.2 to 3.9 times as many samples are required for the unweighted Deming procedure compared with the weighted Deming procedure for range ratios from 2 to 10. For higher range ratios, the efficiency of the unweighted procedure drops dramatically; therefore, as many as ~46-fold more samples are required to detect the same intercept deviation. Finally, it should be underlined that the relationships in Table 3Up presuppose a true proportional SDa relationship throughout the considered interval. At very large range ratios, such as may be observed for various hormones measured by immunoassay procedures, the SDa value often tends to approach a constant plateau in the lowest part of the measurement range, which means that a true proportional relationship is not present over the entire range. In this case, the difference in efficiencies between unweighted and weighted procedures will be smaller than the difference shown in Table 3Up .


   Discussion
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
Appendix 1
References
 
The selected sample size in a method comparison study usually is based on conventions or protocols expressing general guidelines. From a practical point of view, guidelines are useful and necessary. For example, the NCCLS guideline EP-9A suggests measurement of 40 duplicate samples by each method when a new method is introduced in the laboratory as a substitute for an established one (18). Additionally, it has also been proposed that a vendor of an analytical test system should have made a comparison study based on at least 100 samples measured in duplicate with each method. The principle of increased requirements for vendors appears reasonable. This initial validation should be comprehensive to disclose the performance of the assay system in detail. Then the requirement for the ordinary user may be more modest.

Although these general guidelines are useful, they may not be sufficient in the context of a detailed method evaluation, i.e., the vendor's or developer's evaluation. Additionally, assessment of reference methods within the hierarchy of definitive/reference methods may also pose special demands (20). When measurements within a hierarchy that extends from a definitive method, through reference methods, to routine methods are compared, the amount of possible bias in each step should be defined as closely as possible. Here statistical power/sample size considerations become of relevance in the context of a rational method comparison design.

To assure that the necessary sample size is being estimated, basic information for the analytical methods, such as measurement range and SDa, preferably throughout the measurement range, should be available, so that it can be decided whether constant or proportional SDas are most likely in the given situation. This information usually is present because reproducibility frequently is evaluated very early for a new method, and the desirable measurement range is rather characteristic for a given analyte. Medically relevant method differences at selected levels should also be considered. As mentioned earlier, various sources may be consulted, and here the focus has been on differences decided by a regulative authority in the form of the CLIA 88 guidelines (6). The next step is to convert the differences of interest to standardized intercept and/or slope deviations as described. Now Tables 1 or 2 may be consulted. Tables 1Up and 2Up do not cover all situations, but the use of inter- or extrapolation may extend the possibilities. Additionally, one may perform approximate factor adjustments taking into account nonuniform distributions of target values as mentioned earlier. Otherwise, specific calculations using the formulas in the Appendix may be performed, perhaps supplemented with simulations.

The focus has here been on how to detect slope or intercept deviations of given magnitudes and how to perform t-tests separately for slope and intercept deviations. As an alternative, a bivariate approach is possible with an outline of an elliptical joint confidence region for slope and intercept. In this context, it is possible to operate with sets of intercept and slope values that fulfill given critical deviations (21).

In a method comparison study, the choice of an appropriate regression analysis procedure is important. It is advisable to consider whether constant or proportional SDas or other relationships are present. Most frequently, two routine methods are compared. In this situation, analytical errors for both sets of measurement should be recognized. In case involving equal SDas for x and y, the sample size requirement is doubled compared with the situation without analytical errors for x, which is not unexpected: the amount of random error or uncertainty is simply doubled. Application of Deming regression analysis presupposes that the ratio between the SDas for the two methods is constant. This assumption is fulfilled if the SDas are constant throughout the measurement range, as is assumed in the case involving constant analytical error. If the SDas vary with the measurement level, the assumption may still be fulfilled provided that the same relationship with the measurement level applies for both methods. A common situation is the proportional SDa case discussed here, corresponding to constant CVas for the methods, which often occur with good approximation. Unless the measurement range is very narrow, the Deming regression procedure is generally robust toward non-constant or misspecified SDa ratios(10)(22)(23). If duplicate sets of measurements for both methods are available, the SDa ratio is estimated conveniently from the actual data set. In cases involving proportional SDas, the simple, unweighted form of the Deming procedure still provides an unbiased slope estimate, but as shown here and previously, the weighted Deming procedure is the most efficient approach (9). For a range ratio of 10, the weighted approach requires less than one-half the number of samples to provide the same precision of the slope estimate as the unweighted Deming procedure. Furthermore, if a computerized resampling procedure such as the jackknife method is used for estimation of SDas, the assumption of normality of analytical error distributions is not necessary (12)(24).

In the planning phase, factors other than statistical power should be taken into account. The representativeness of samples is important. Samples from relevant patient categories should be included to help disclose possible interference phenomena. The calculated sample size requirement may then be regarded as a minimum demand from a statistical point of view, which may be modified to take other aspects into account. Furthermore, it is important that an internal quality-control system is in effect to assure that the methods to be compared are running in the in-control state. Comparisons of measurements preferably should be undertaken over several days, e.g., at least 5 days, to ensure that the method comparison does not become dependent on the performance of the methods in one particular analytical run(18).

Sample size estimations seldom have been considered in the context of method comparison studies based on regression analysis. In a recent study, Hartmann et al. (25) used simulations to evaluate the power of OLR and the Deming procedure for selected data examples. Range ratios of 1.5, 5, and 15 were evaluated in relation to CVas of 2% or 5%. The data examples are not exactly comparable with those of the present study, and the number of simulation runs for each parameter selection is somewhat small. However, for situations with constant SDas (homoscedasticity), the authors found sample size estimates of the same order of magnitude as observed here. The authors considered only the unweighted forms of regression analysis.

Passing and Bablok (26) studied the sample size required to obtain a given power by nonparametric rank regression analysis. For a series of simulation examples, these authors considered the necessary sample sizes for detection of given slope deviations with a power of 80% given a type I error of 5%, assuming proportional SDas. The same pattern as observed here for testing of slope deviations was apparent: the larger the range of values, the smaller the required sample size (other factors being equal). Although the examples used by Passing and Bablok (26) are not exactly comparable to the situations dealt with here, it is apparent that the sample size requirements of the rank procedure exceed those of the weighted Deming procedure, which agrees with the fact that the latter procedure is more efficient than the rank procedure (9).

In recent years, the difference plot described by Bland and Altman(27) has gained increasing popularity as a tool for evaluation of method comparison data (28). Although the difference plot in itself is very instructive for displaying differences, the associated summary statistic in the form of a paired t-test is not appropriate for the analysis of method comparison data because it might be misleading in the presence of a systematic proportional difference (29)(30)(31). The t-test only evaluates whether the mean measurement levels of two methods agree, but not whether the measurements are comparable throughout the measurement range. For example, if measurements in the low range by a new method tend to be higher than those of the established method, and vice versa in the high range, the averages of the measurements by each method agree, and the paired t-test shows no difference. A regression analysis, on the other hand, would clearly disclose the different performances of the methods.

In conclusion, the planning of a method comparison study to achieve a given power for detection of medically significant differences should be considered carefully. In this way, a method comparison study is likely to be conclusive: Either the null hypothesis of no difference is accepted, or the presence of a relevant difference is established. Otherwise, a statistically nonsignificant slope deviation from unity and/or intercept deviation from zero may either imply that the null hypothesis is true or be an example of a type II error, i.e., an overlooked real difference of medical importance.


   Appendix 1
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
Appendix 1
References
 
sample size formula
A general, simplified formula for computation of the approximately necessary sample size for detection of adifference {Delta} with a given type I error and power (16) is:

(1)

where c is the proportionality factor determining the standard error of the estimated difference D, which corresponds to the true difference {Delta}, i.e., the standard error of D is c/{surd}N. tp/2 is the t value for the given type I error (significance level). For the usual level of P = 5%, tp/2 = 1.96 (asymptotically). t1 - q is the t value for the desired power 1 - q, where q is the type II error. For a power of 90%, t1 - q is 1.28 (asymptotically).

regression analysis given constant SDas
This method applies to OLR and Deming regression analysis. For the OLR procedure, the slope, intercept, and their standard errors are estimated (14) as:






Yesti refers to the estimated Y value for a given xi according to the regression equation. In cases involving duplicate sets of measurements, each xi and yi represent the mean of individual measurements [xi = (x1i + x2i)/2 and yi = (y1i + y2i)/2].

Given a uniform distribution of x, for any given interval ust = u/N is nearly constant over a wide range of N values. Furthermore, the expression xm2/ust is scale invariant; it depends only on the ratio between the maximum and minimum values of the interval, i.e., the range ratio. On the basis of this background, we rewrite the standard error expressions as:





where ca0 and cb are constants depending only on the range ratio. Under the given model, SDy.x equals the analytical SD of method y (SDay). CVay here is the CV at the middle of the interval in question, i.e., SDay/ym. We here suppose that there is only a negligible difference between xm and ym; therefore, we have that CVay is approximately equal to SDay/xm. Thus, if the slope devi-ation from unity and the intercept deviation from zero are expressed in standardized forms, i.e., in CVa and SDa units of method y, respectively, we obtain the following sample size formulas analogous to Eq. 1Up :

(2)


(3)

The Deming regression line is estimated as:




SDas are estimated from duplicate sets of measurements as:


For the Deming procedure, general formulas for standard errors of slope and intercept are complicated (32), and in practice, they are most easily estimated using a computerized resampling principle such as the jackknife method (12)(15). However, for a standard situation with a slope close to unity and equal SDas for methods x and y, i.e., SDax = SDay = SDa and CVa = SDa/xm, we have the following relationships:


This relationship has been verified by simulations. Thus, under the present standardized conditions, Eqs. 2Up , and 3Up , developed for the OLR method, also apply to the Deming procedure with the modification that the sample size values are multiplied by two. On this basis, Table 1Up has been constructed such that the stated sample size values apply to Deming regression analysis. For the OLR procedure, the sample sizes are divided by two. At parameter combinations corresponding to small to moderate sample sizes (<100–150), the computations were supplemented with simulations (1000 runs for each parameter combination), and sample sizes were adjusted. Accordingly, there is not an exact inverse relationship between the standardized {Delta} value and the squared sample size.

regression analysis given proportional SDas
This situation applies to WLR and weighted Deming regression analysis.

For WLR (9)(13), we have:





The weights (wi) are inversely proportional to the squared SDa of y measurements at a given level. The SDa is assumed to be a function [h( · )] of x:

where k is a proportionality factor. For the proportional SDa case, which is assumed to hold true in the present context, wi = 1/xi2. The proportionality factor is estimated from the dispersion around the line:

The standard errors of slope and intercept are:

For situations with an intercept close to zero and a slope close to one, the factor k is approximately equal to the CVa for y measurements CVay. Given a uniform x distribution, it turns out that uwst = uw/N is scale invariant and is nearly constant for a wide range of N. Thus, we have approximately:



where cb = 1/{surd}uwst and ca0 = xmw [1/(xmw2{Sigma}stwi) + 1/uwst]0.5, with {Sigma}stwi = (1/N){Sigma}wi. cb is scale invariant. For ca0, the expression in brackets is scale invariant, whereas xmw is proportional to the scale for a given range ratio in the same way as the ordinary mean xm. This proportion-ality is taken into account in the expression for {Delta}{alpha}0st as displayed below, where for convenience xm enters the formula instead of xmw and ca0 is adjusted accordingly. The sample size formulas are:

(4)


(5)

For the weighted Deming procedure, the slope and intercept are estimated as:



It is supposed here that the ratio between the squared SDas is constant ({lambda} = SDax2/SDay2) throughout the mea-surement range. The SDas are functions of the target values:

Under the assumption of proportional SDas and a slope close to one and an intercept close to zero, {lambda} is approximately equal to CVax2/CVay2. In weighted Deming regression analysis, the weights wi, which equal 1/[(Xesti + {lambda}Yesti)/(1 + {lambda})]2, are obtained by an iterative principle as described using the CBstat program (12)(15). The weights are expressed here as a weighted mean of Xest and Yest, which is slightly more optimal than the simple mean described earlier (9)(12).

For the weighted Deming procedure, general formulas for the standard errors of the slope and intercept are complicated, and again the jackknife method may be applied (12)(15). As above, however, for the simplified situation considered here with a slope close to unity and equal CVas for methods x and y, i.e., CVax = CVay = CVa, we have the following relationships:


Table 2Up has been constructed primarily from these formulas, with adjustments based on simulation studies on the weighted Deming procedure. The sample sizes for WLR are approximately one-half of the sample sizes listed. The weighted approach presupposes positive sample values.


   Footnotes
 
1 Nonstandard abbreviations: SDa, analytical standard deviation; OLR, ordinary least-squares regression analysis; WLR, weighted least-squares regression analysis; D, estimated difference; {Delta}, true difference; and CVa, analytical coefficient of variation.


   References
Top
Abstract
Introduction
Materials and Methods
Results
Discussion
Appendix 1
References
 

  1. Skendzel LP, Barnett RN, Platt R. Medically useful criteria for analytic performance of laboratory tests. Am J Clin Pathol 1985;83:200-205. [ISI][Medline] [Order article via Infotrieve]
  2. Fraser CG, Hyltoft Petersen P. Desirable standards for laboratory tests if they are to fulfill medical needs. Clin Chem 1993;39:1447-1455. [Abstract]
  3. Linnet K. Choosing quality control systems to detect maximum medically allowable analytical errors. Clin Chem 1989;35:284-288. [Abstract/Free Full Text]
  4. . Laboratory Standardization Panel of the National Cholesterol Education Program. Current status of blood cholesterol measurements in clinical laboratories in the United States: a report from the Laboratory Standardization Panel of the National Cholesterol Education Program. Clin Chem 1988;34:193-201. [Free Full Text]
  5. Stein EA, Myers GL. National Cholesterol Education Program recommendations for triglyceride measurement: executive summary. Clin Chem 1995;41:1421-1426. [Free Full Text]
  6. . US Department of Health and Human Services. Medicare, Medicaid, and CLIA programs: regulations implementing the Clinical Laboratory Improvement Amendments of 1988 (CLIA). Final rule. Fed Regist 1992;57:7002-7186. [Medline] [Order article via Infotrieve]
  7. Cornbleet PJ, Gochman N. Incorrect least-squares regression coefficients in method-comparison analysis. Clin Chem 1979;25:432-438. [Abstract/Free Full Text]
  8. Parvin CA. A direct comparison of two slope-estimation techniques used in method-comparison studies. Clin Chem 1984;30:751-754. [Abstract/Free Full Text]
  9. Linnet K. Evaluation of regression procedures for methods comparison studies. Clin Chem 1993;39:424-432. [Free Full Text]
  10. Linnet K. Performance of Deming regression analysis in case of a misspecified analytical error ratio. Clin Chem 1998;44:1024-1031. [Abstract/Free Full Text]
  11. Ross JW, Lawson NS. Analytical goals, concentration relationships, and the state of the art for clinical laboratory precision. Arch Pathol Lab Med 1995;119:495-513. [ISI][Medline] [Order article via Infotrieve]
  12. Linnet K. Estimation of the linear relationship between the measurements of two methods with proportional errors. Stat Med 1990;9:1463-1473. [ISI][Medline] [Order article via Infotrieve]
  13. Hald A. Statistical theory with engineering applications 1952:551-557 Wiley New York. .
  14. Snedecor GW, Cochran WG. Statistical methods, 6th ed 1967:139-167 Iowa State University Press Ames, IA. .
  15. Linnet K. CBstat: a program for statistical analysis in clinical biochemistry. Reference manual 1998:1-53 K Linnet Risskov, Denmark. .
  16. Snedecor GW, Cochran WG. Statistical methods, 6th ed 1967:113 Iowa State University Press Ames, IA. .
  17. . American Diabetes Association. Screening for type 2 diabetes. Diabetes Care 1998;21(Suppl 1):S20-S22.
  18. . National Committee for Clinical Laboratory Standards. Method comparison and bias estimation using patient samples. Approved guideline. NCCLS document EP9-A 1995:1-36 NCCLS Villanova, PA. .
  19. Wakkers PJM, Hellendoorn HBA, Op De Weegh GJ, Heerspink W. Applications of statistics in clinical chemistry. A critical evaluation of regression lines. Clin Chim Acta 1975;64:173-184. [ISI][Medline] [Order article via Infotrieve]
  20. Tietz NW. A model for a comprehensive measurement system in clinical chemistry. Clin Chem 1979;25:833-839. [Free Full Text]
  21. Gerbet D, Auget J-L, Maccario J, Cazalet C, Raichvarg D, Ekindjian OG, Yonger J. New statistical approach in biochemical method-comparison studies by using Westlake's procedure, and its application to continuous-flow, centrifugal analysis, and multilayer film analysis techniques. Clin Chem 1983;29:1131-1136. [Abstract/Free Full Text]
  22. Beech DG. Some notes on the precision of the gradient of an estimated straight line. Appl Stat 1961;10:14-31.
  23. Lakshminarayanan MY, Gunst RF. Estimation of parameters in linear structural relationships: sensitivity to the choice of the ratio of error variances. Biometrika 1984;71:569-573. [Abstract/Free Full Text]
  24. Wu CFJ. Jackknife, bootstrap and other resampling methods in regression analysis (with discussion). Ann Stat 1986;14:1261-1295.
  25. Hartmann C, Smeyers-Verbeke J, Penninckx W, Massart DL. Detection of bias in method comparison by regression analysis. Anal Chim Acta 1997;338:19-40.
  26. Passing H, Bablok W. Comparison of several regression procedures for method comparison studies and determination of sample sizes. J Clin Chem Clin Biochem 1984;22:431-445. [ISI][Medline] [Order article via Infotrieve]
  27. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307-310. [ISI][Medline] [Order article via Infotrieve]
  28. Petersen PH, Stöckl D, Blaabjerg O, Pedersen B, Birkemose E, Thienpont L, et al. Graphical interpretation of analytical data from comparison of a field method with a reference method by use of difference plots. Clin Chem 1997;43:2039-2046. [Abstract/Free Full Text]
  29. Westgard JO, Hunt MR. Use and interpretation of common statistical tests in method comparison studies. Clin Chem 1973;19:49-57. [Abstract]
  30. Westgard JO, deVos D, Hunt MR, Quam EF, Carey RN, Garber CC. Concepts and practices in the evaluation of clinical chemistry methods. Part III. Statistics. Am J Med Technol 1978;44:552-570. [ISI][Medline] [Order article via Infotrieve]
  31. Lin LIK. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989;45:255-268. [ISI][Medline] [Order article via Infotrieve]
  32. Kendall MG, Stuart A. The advanced theory of statistics, Vol. 2 1973:393-397 Charles Griffin London. .



The following articles in journals at HighWire Press have cited this article:


Home page