|
|
||||||||
Reviews |
1 Departments of Biostatistics and Epidemiology and the Division of Radiology-Wb4, The Cleveland Clinic Foundation, Cleveland, OH. 2 Department of Pathology, University of Texas Southwestern Medical Center, Dallas, TX.
aAddress correspondence to this author at: Department of Biostatistics, The Cleveland Clinic Foundation, 9500 Euclid Ave., Cleveland, OH 44195-5196. Fax 216-444-3466; e-mail nobuchow{at}bio.ri.ccf.org.
| Abstract |
|---|
|
|
|---|
Methods: We reviewed every original work using ROC curves and published in Clinical Chemistry in 2001 or 2002. For each article we recorded phase of the research, prospective or retrospective design, sample size, presence/absence of confidence intervals (CIs), nature of the statistical analysis, and major analysis problems.
Results: Of 58 articles, 31% were phase I (exploratory), 50% were phase II (challenge), and 19% were phase III (advanced) studies. The studies increased in sample size from phase I to III and showed a progression in the use of prospective designs. Most phase I studies were powered to assess diagnostic tests with ROC areas
0.70. Thirty-eight percent of studies failed to include CIs for diagnostic test accuracy or the CIs were constructed inappropriately. Thirty-three percent of studies provided insufficient analysis for comparing diagnostic tests. Other problems included dichotomization of the gold standard scale and inappropriate analysis of the equivalence of two diagnostic tests.
Conclusion: We identify available software and make some suggestions for sample size determination, testing for equivalence in diagnostic accuracy, and alternatives to a dichotomous classification of a continuous-scale gold standard. More methodologic research is needed in areas specific to clinical chemistry.
| Introduction |
|---|
|
|
|---|
Not surprisingly, ROC curves are used quite often by clinical chemists. We reviewed every original work that used ROC curves as part of the statistical analysis and was published in Clinical Chemistry in 2001 or 2002. We found 58 such articles. A review of these papers allowed us to observe how the accuracy of clinical laboratory diagnostic tests are assessed, compared, and reported in the literature; and to identify common problems with the use of ROC curves and offer some possible solutions.
| Methods |
|---|
|
|
|---|
We defined the phase of the research as did Zhou et al. (2). Phase I is the "exploratory phase", in which a new test is first evaluated in a clinical setting to determine whether the test has any ability to discriminate diseased from nondiseased patients (or, sometimes, to discriminate between two groups of diseased patients; e.g., iron deficiency anemia vs anemia of chronic disease or liver cirrhosis vs chronic hepatitis). ROC curves are often used in these studies to test whether the AUCROC exceeds 0.5. If not, then no further assessment of the diagnostic test is warranted. Phase II in the clinical assessment of the accuracy of a diagnostic test is called the "challenge phase", in which the accuracy of one or more tests is estimated for difficult cases to determine for whom the test may fail and to perhaps identify ways to improve it before further assessment. Finally, phase III is the "advanced phase", in which the accuracy of one or more tests is estimated and compared for a well-defined and generalizeable clinical population. From these studies we can estimate how the test will perform in clinical practice; in contrast, phase I studies tend to overestimate accuracy (because easy-to-diagnose patients are often selected for the study sample), and phase II studies tend to underestimate accuracy (because difficult cases are often selected for the study sample).
We classified a study as retrospective if the patients in the study sample were selected for the study based on their known disease status. A prospective design, in contrast, is one in which the patients are recruited based on their signs and symptoms, before determining their true disease status.
| Results |
|---|
|
|
|---|
|
Phase II studies in which ROC curves were used were the most common type of accuracy study in Clinical Chemistry, with 29 such studies (50%), including 18 (62%) retrospective in design and 11 (38%) prospective in design. These studies were larger than the phase I studies, with a median of 88 diseased patients (range, 18442) and 99 nondiseased patients (range, 8730).
We found 11 phase III studies in which ROC curves were used, with 64% being prospective. The median size of these phase III studies was 140 diseased patients (range, 15721) and 174 nondiseased patients (range, 381366). Thus, the diagnostic accuracy studies in Clinical Chemistry showed an expected increase in sample size from phase I to III and a progression to more prospective designs.
common problems
The problems we identified with the use of ROC curves for comparing or reporting diagnostic accuracy in the above studies are summarized in Table 2
. Most of the articles we reviewed (79%) included a comparison of two or more tests to determine which test(s) had superior diagnostic accuracy. In 40% of such articles, we found no statistical analysis of the comparison between tests. The authors of some articles reported the AUCROC of the tests and the CIs for the areas but gave no statistical test for determining whether the ROC areas differed. In other studies, a statistical comparison of the accuracies of two tests was carried out, but the diagnostic tests had been performed on the same patients (paired sample design) and this pairing was not taken into account in the statistical analysis.
|
We found five reports in which the true disease status of the patient (i.e., the findings of the gold standard test) was not a simple binary outcome (e.g., celiac disease present or absent), but rather the true disease status was represented by a quantitative measurement. Some examples of gold standard tests that yield a continuous-scale outcome are insulin clearance to measure glomerular filtration rate and SPECT to measure left ventricular ejection fraction (LVEF). To perform a traditional ROC analysis, the true disease status of patients must be dichotomous. One approach is to choose a single cutpoint, for example, LVEF <40%, such that patients with values less than the cutpoint are considered diseased, whereas others (who may have a LVEF only one percentage point higher) are considered nondiseased. Another approach is to choose two cutpoints. Patients with values less than the lower cutpoint or greater than the higher cutpoint are compared in the study, whereas patients with values in the gray zone (i.e., between the two cutpoints) are excluded from the study. An example is a study of indicators of iron status used to discriminate patients with iron deficiency anemia from anemia of chronic disease (3). The gold standard is ferritin concentrations measured by the ferrozine method. Concentrations <20 µg/L are considered iron deficiency anemia, and concentrations >240 µg/L (for women) and >375 µg/L (for men) are considered anemia of chronic disease; patients with concentrations in the middle are excluded. The choice of cutpoints for dichotomizing the gold standard is critical, but the choice is often quite arbitrary. We show in a later section how the choice of cutpoints for the gold standard affects the AUCROC for the diagnostic test and how the two-cutpoint approach can overestimate accuracy.
Most checklists for studies reporting the diagnostic accuracy of medical tests (4)(5)(6)(7) have noted the importance of reporting CIs for measures of test accuracy. Among the 58 articles we reviewed, 39 (67%) included CIs. Sometimes, however, the CIs were constructed inappropriately.
In one report, diagnostic accuracy was estimated from more than one observation from the same patient, so-called "clustered" data. An example of clustered data is measurements from the two ears of the same patient. Data from the same patient are inherently correlated to some degree. Often the correlation is small, but even a small amount of intracluster correlation will lead to incorrect P values (usually inappropriately small P values).
There were two articles that attempted to show that a new test was at least as accurate (and perhaps more accurate) as an existing test, so-called "noninferiority" studies. In only one article, however, was a statistical test used that was appropriate for testing a hypothesis of equivalency. In the other, the authors concluded that the accuracies of the tests were equivalent because the difference between the ROC areas was not statistically significant at the 0.05 level. This approach is not valid for assessing equivalence or noninferiority because the risk is high for incorrectly concluding equivalence, particularly when the sample size is small.
Finally, from our review it appeared that some studies were underpowered (too small of a sample size). In a later section we offer some direction in determining the appropriate sample size for different types of studies.
example of typical study with appropriate roc analysis
Before addressing some possible solutions to these problems, we want to cite an example of a typical study seen in Clinical Chemistry with an appropriate ROC analysis. Martinez et al. (8) performed a phase II prospective study comparing three teststotal prostate-specific antigen (PSA), PSA complexed to
1-antichymotrypsin (PSA-
1-ACT), and the ratio of PSA-
1-ACT to total PSA (PSA-
1-ACT:PSA)for the differential diagnosis of prostate cancer and benign prostatic hyperplasia. They recruited consecutive patients who had been referred for prostatic evaluation. A total of 146 patients met the eligibility criterion of a total PSA between 10 and 30 µg/L. All patients underwent biopsy. The authors estimated the AUCROC for each diagnostic test and compared the areas using nonparametric methods for paired data (because the three diagnostic tests were performed on all study patients) through MedCalc (9). The authors reported 95% CIs for the ROC areas and cited the observed sensitivity and specificity at various cutoff points. The authors found that the PSA-
1-ACT:PSA ratio had superior accuracy (i.e., statistically significant at the 0.05 level) for patients with total PSA values in both the ranges 1020 µg/L and 2030 µg/L. They concluded that additional prospective studies with large numbers of patients were needed to confirm their findings.
| Possible Solutions to Common Problems with ROC Analysis |
|---|
|
|
|---|
Zhou et al. (2) discuss several approaches to CI construction. We present here a parametric approach to constructing CIs for sensitivity at a particular FPR [or analogously, for constructing CIs for specificity at a particular false-negative rate (FNR)]. This method assumes that the test results, or a transformation of them, follow a binormal distribution (that is, one normal distribution for test results of patients with disease and another normal distribution for test results of patients without disease). We use the following notation to describe the binormal distribution:
![]() |
1 and
0 are the standard deviations of the normal distributions for patients with and without disease, respectively.
The sensitivity at a particular FPR = e can be estimated from
e(FPR = e) = 1
(
Ze â), where
is the cumulative normal distribution function, and Ze is such that
(Ze) = 1 e. For example, if we are interested in the sensitivity at a FPR of 10% (i.e., e = 0.10), then Ze = 1.28. If we obtain estimates of â = 0.8 and
= 1.2, then
e(FPR = 0.10) = 1
(1.2 x 1.28 0.8) = 0.23.
To get the variance of Sê(FPR = e) and its CI, we transform Sê(FPR = e) to ZSe as follows:
![]() |
![]() |
and their covariance are available from programs such as ROCKIT (10).
Assuming asymptotic normality, the 100(1
)% CI for the transformed sensitivity corresponding to a particular FPR of e is:
![]() |
![]() |
/2) is the upper
/2 percentile of the standard normal distribution. The lower and upper "transformed" confidence limits, LL and UL, for sensitivity can then be calculated by:
![]() |
![]() |
Se = 0.736. If we obtain estimates of Var(â) = 0.0214 and Var(
) = 0.0091, with Cov(â,
) = 0.0068, then:
![]() |
![]() |
![]() |
![]() |
![]() |
Note that there is software for constructing a CI for sensitivity at a fixed FPR (or specificity at a fixed FNR), as well as for testing for a difference between two diagnostic tests at a fixed FPR (or FNR) (10).
software for comparing the accuracy of diagnostic tests and estimating sample size
Specialized software is available (9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22) for a variety of ROC analyses, including estimating and comparing ROC curves and their areas (AUCs) for paired (same patients) or unpaired (different patients) designs, from clustered data or one observation/patient data, using parametric or nonparametric methods. There is also software for estimating and comparing the partial areas under ROC curves (i.e., ROC area in the specified FPR range of e1 to e2) and for estimating sample size or power. See the recent review by Stephan et al. (23) for a comparison of some of the available ROC software.
assessing noninferiority of a new test with that of a standard test
When developing a new diagnostic test, there are situations in which the new test needs to have accuracy as good as, but not necessarily better than, the accuracy of an existing, or standard, test for the new test to replace the standard test. For example, the new test might be safer, easier, or quicker to perform, or it may be less expensive than the standard test.
For these types of studies, we want to make sure that the accuracy of the new test is not inferior to the accuracy of the standard test before replacing the standard test. We need to test the hypotheses:
![]() |
![]() |
S is the accuracy of the standard test,
N is the accuracy of the new test, and
M is the smallest difference in accuracy that is unacceptable. Note that
M should be specified in the planning phase of the study (i.e., not after examining the data; this can lead to bias).
Consider an example. Suppose a standard test has an AUC of 0.90. A new, quicker, and less expensive test has been developed; we determine that the new test must have an AUC of 0.85 or greater to replace the standard test. Then,
M = 0.06.
An appropriate statistical test for assessing noninferiority is:
![]() |
S and
N are the estimates of accuracy for the standard and new test, respectively, and vâr(
S
N) is an estimate of the variance of the difference between accuracies of the two tests. Any measure of accuracy can be used (e.g., AUC or partial area under the ROC curve), and software (9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22) is available for estimating the accuracies of the tests and the variance of their difference. For a type I error rate of 5%, we would reject the null hypothesis if z is less than 1.645; otherwise, we do not reject the null hypothesis and thus have insufficient evidence to replace the standard test.
roc analysis when the gold standard does not yield a dichotomous outcome
In this section we show through a simulation study the effects on the AUC of dichotomizing the results of a gold standard when that gold standard does not produce binary results (i.e., disease is present or absent) but rather the results are on a continuous scale. We then offer a new method for estimating the AUC that does not require dichotomizing the gold standard.
In a simulation study, we first investigated how the AUC is affected when a single cutpoint for separating diseased and nondiseased patients is varied. We generated 1000 samples of 1000 observations each from a bivariate normal distribution (i.e., we used a large sample size so that differences could be attributed to systematic bias and not to random variability). One of the random variables from this bivariate distribution was designated as the gold standard, and the other random variable was considered the diagnostic test. We set the correlation between these two variables equal to 0.75. We then varied the cutpoint of the first random variable, such that observations with values above the cutpoint were considered "diseased" and observations below the cutpoint were considered "nondiseased". As shown in Table 3
, the nonparametric estimates (24) of the ROC area varied as the cutpoint varied; the differences were small but were statistically significant. In other words, although the underlying relationship between the gold standard and diagnostic test did not change, the estimates of the AUC did change because of the unnatural dichotomization of the outcomes.
|
Now consider the effect of using two cutpoints to identify two populationsone with values below the first cutpoint and the other above the second cutpointomitting from the analysis the patients with values between the two cutpoints. We used simulated data as described previously. As shown at the bottom of Table 3
, the estimated AUC increased greatly when the cases in the gray zone were omitted, and the wider the gray zone, the greater the increase in the ROC area. Clearly, dichotomizing a continuous-scale gold standard introduces bias into the estimates of the accuracy of a diagnostic test.
A nonparametric estimate of accuracy, analogous to the nonparametric ROC area estimate, can be constructed without any cutpoints. The formula is given in Eq. 1
. In words, we compare the test scores of each patient with the test scores of all other patients, assigning a weight of 1 if the test result of the patient with a higher (more serious) outcome value exceeds the test result of the patient with the lower (less serious) outcome value. If two patients have the same test result or the same gold standard outcome, a weight of 0.5 is assigned. Otherwise, a weight of 0 is given. The sum of these weights divided by the number of pairs is the nonparametric estimate of accuracy. The interpretation is similar to that for the ROC area; it is the probability that of two randomly chosen patients, the patient with the higher (more serious) outcome also has the higher (more suspicious) test result:
![]() | (1) |
j, n is the total number of patients in the study sample, xit is the test result of the ith patient with gold standard outcome t, xjs is the test result of the jth patient with gold standard outcome s, and:
= 1 if t > s and xit > xjs or s > t and xjs > xit
= 0.5 if t = s or xjs = xit
= 0 otherwise.
Note that the definition of
can be modified if the scales of the gold standard and diagnostic test are inversely related.
This type of accuracy estimate has been used previously to assess the prognostic ability of models for predicting survival time (25). A CI for this accuracy measure or a CI for the difference between two estimates of accuracy can be obtained by bootstrapping (26).
computing the required sample size for phase i, ii, and iii studies
In this section we discuss some possible approaches to sample size calculation for studies of diagnostic test accuracy. We discuss sample size according to the phase of the study because the goals of the phases differ, and thus the sample size requirements differ. The methods for sample size calculation that we present here are based on large-sample theory for normally distributed data (27); the reader should be mindful of this when using these formulae.
In phase I studies, the usual goal is to determine whether the diagnostic test has any ability to discriminate diseased patients from controls. A useful hypothesis on which to base sample size estimation is whether the AUC exceeds 0.5. The null hypothesis is that the AUC equals 0.5, vs the alternative hypothesis that the AUC is >0.5 (one-sided test):
![]() |
![]() |
A formula for computing sample size to test these hypotheses is:
![]() | (2) |
) is the variance function of
, given by:
![]() | (3) |
is equal to V(
)/nD; A =
1(
) x 1.414;
1 is the inverse of the cumulative normal distribution function;
is the ratio of the number of control patients (nC) to the number of diseased patients (nD) in the study sample (i.e.,
= nC/nD);
is the conjectured area under the ROC curve (under the alternative hypothesis); z
is the upper
th percentile of the standard normal distribution, where
is the type I error rate (usually
= 0.05); and zß is the upper ßth percentile of the standard normal distribution, where ß is the type II error rate (often ß = 0.10 or 0.20).
In Table 4
we have computed the sample size for a range of values for
(e.g.,
= 1 means equal numbers of patients with and without the disease in the study sample;
= 0.5 means twice as many diseased patients as control patients in the study sample; and
= 2.0 means twice as many control patients as diseased patients in the study sample) and for a range of values for the conjectured AUC. The type I error rate has been set at 0.05; the type II error rate is
0.10 (power
0.90).
|
For example, for a balanced design (i.e., equal numbers of patients with and without disease), if the accuracy of the diagnostic test is expected to be fair (e.g., 0.70), then 33 control patients and 33 diseased patients (total of 66 patients) are needed. From Table 1
it looks like the phase I studies in Clinical Chemistry are, in general, reasonably powered for assessing diagnostic tests with AUC
0.70.
In phase II studies, we often compare the accuracies of two or more diagnostic tests. The study sample usually represents patients difficult to diagnose; the study patients might have early or atypical disease and/or other conditions that might interfere with the test, and the controls will likely have other conditions that might mimic the disease of interest. A common measure of accuracy for comparing tests at this phase is again the AUC, but other measures of accuracy, such as the partial area under the ROC curve, are also applicable, and sample size determination is similar.
We consider the null hypothesis, that the accuracies of two tests are equal, vs the alternative hypothesis, that the accuracies differ (two-sided test). (Later, we consider sample size determination for noninferiority studies.) A formula for computing sample size to test these hypotheses is given in Eq. 4
:
![]() | (4) |
and zß are the same as in Eq. 2
/2); V0 and VA denote the variance functions under the null and alternative hypotheses, respectively;
1 and
2 denote the conjectured accuracies of diagnostic test 1 and test 2, respectively; and V(
1
2) = V(
1) + V(
2) 2C(
1,
2). For paired designs (i.e., the same study participants undergo both diagnostic tests), the results from the two tests will be correlated, i.e., C(
1,
2), the covariance function, will be nonzero (usually taking on a positive value); nD in Eq. 4
1,
2) is zero, and nD is the number of patients with disease needed for each diagnostic test. There is useful software for sample size determination for comparing two AUC (10)(11)(21) or two partial areas (11).
We now consider testing noninferiority of a new test to a standard test. The sample size calculation is similar to that in Eq. 4
, but we need to take the value of
M into account. The formula for sample size determination for testing noninferiority is given in Eq. 5
(28).
![]() | (5) |
In phase III studies we usually assess and compare the accuracy of diagnostic tests for a well-defined and generalizeable population. We want to report CIs for test accuracies, and we want these CIs to be narrow so that clinicians using the tests in practice have a good sense of the abilities of the tests and can interpret the results appropriately for their patients. The AUCROC is a global measure of the accuracy of a test, i.e., it is the average sensitivity over all possible values of specificity, or the average specificity over all possible values of sensitivity (1)(29). It is not very useful for a phase III study (2)(30). One alternative to the AUC is to find the optimal point on the ROC curve (based on the prevalence of disease among patients undergoing the test and the costs, e.g., patient morbidity or monetary, of false positives and false negatives) and report the corresponding sensitivity and FPR and their CIs (31)(32)(33)(34)(35). Another approach is to estimate the average sensitivity for the range of FPRs that is useful clinically (e.g., average sensitivity when the FPR is <0.10) or the average specificity for the range of sensitivities useful clinically (e.g., average specificity when sensitivity is >0.90), i.e., the partial area index (36)(37). Determination of the required sample size for phase III studies using these indices of accuracy is often more complex because we need to know the shape of the ROC curve, which can be estimated from previous phase II studies. For further discussion of sample size issues, see Zhou et al. (2) or Obuchowski (38).
| Discussion |
|---|
|
|
|---|
40 items for inclusion in published papers on diagnostic accuracy. We focused here on how ROC curves are currently being used by clinical chemists and what shortcomings exist in the ROC analyses being performed. Although there were many reports published on the accuracy of new and old diagnostic tests, we reviewed only the 58 articles using ROC curves. These 58 articles represented a spectrum from early, exploratory studies to large studies of mature tests. They included a mixture of retrospective and prospective studies, with sample sizes ranging from 30 to >1000 patients.
Some of the problems we saw with the use of ROC curves are common in other disciplines; for example, a lack of reporting of CIs for measures of accuracy is a common problem in diagnostic radiology as well. We were surprised by the large number of articles that failed to properly compare the accuracies of the two diagnostic tests. Identification of available software should resolve this problem. Other problems, in particular the dichotomization of a continuous-scale gold standard outcome, are specific to the kinds of diagnostic tests often evaluated by clinical chemists; further methodologic research is needed in these areas.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
1-antichymotrypsin improves the discrimination between prostate cancer and benign prostatic hyperplasia in men with a total PSA of 10 to 30 µg/L. Clin Chem 2002;48:1251-1256.The following articles in journals at HighWire Press have cited this article:
![]() |
C. F M Linssen, O. Bekers, M. Drent, and J. A Jacobs C-reactive protein and procalcitonin concentrations in bronchoalveolar lavage fluid as a predictor of ventilator-associated pneumonia Ann Clin Biochem, May 1, 2008; 45(3): 293 - 298. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M.G. Leeflang, K. G.M. Moons, J. B. Reitsma, and A. H. Zwinderman Bias in Sensitivity and Specificity Caused by Data-Driven Selection of Optimal Cutoff Values: Mechanisms, Magnitude, and Solutions Clin. Chem., April 1, 2008; 54(4): 729 - 737. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Coenen, P. Verschueren, R. Westhovens, and X. Bossuyt Technical and Diagnostic Performance of 6 Assays for the Measurement of Citrullinated Protein/Peptide Antibodies in the Diagnosis of Rheumatoid Arthritis Clin. Chem., March 1, 2007; 53(3): 498 - 504. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. A. DeBari Computation of decision levels from differentiated logistic regression probability curves. Ann. Clin. Lab. Sci., March 1, 2006; 36(2): 194 - 200. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. C. Adams, D. M. Reboussin, C. Leiendecker-Foster, G. C. Moses, G. D. McLaren, C. E. McLaren, F. W. Dawkins, I. Kasvosve, R. T. Acton, J. C. Barton, et al. Comparison of the Unsaturated Iron-Binding Capacity with Transferrin Saturation as a Screening Test to Detect C282Y Homozygotes for Hemochromatosis in 101 168 Participants in the Hemochromatosis and Iron Overload Screening (HEIRS) Study Clin. Chem., June 1, 2005; 51(6): 1048 - 1052. [Full Text] [PDF] |
||||
![]() |
T. Keller, H. Butz, M. Lein, M. Kwiatkowski, A. Semjonow, H.-J. Luboldt, P. Hammerer, C. Stephan, and K. Jung Discordance Analysis Characteristics as a New Method to Compare the Diagnostic Accuracy of Tests: Example of Complexed Versus Total Prostate-Specific Antigen Clin. Chem., March 1, 2005; 51(3): 532 - 539. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hilden and N. Obuchowski What Properties Should an Overall Measure of Test Performance Possess? * Dr. Obuchowski responds: Clin. Chem., February 1, 2005; 51(2): 471 - 472. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |