|
|
||||||||
Letters to the Editor |
1 Department of Biostatistics, University of Copenhagen, Blegdamsvej 3, DK-2200 Copenhagen N, Denmark, Fax 45-35-327907
aE-mail j.hilden{at}biostat.ku.dk
To the Editor:
Obuchowski et al. (1) recently reviewed the uses and misuses of ROC curves and explained, in simple terms, some sophisticated solutions to frequent problems, thereby popularizing tools described elsewhere, notably in the book coauthored by Nancy Obuchowski (2). What prompts me to comment on this careful and timely review are two fundamental questions:
What properties should an overall measure of test performance possess? How do we make better tests score better?
ROC curves are indispensable and conceptually straightforward, but selecting a single overall performance measure requires some care. In referring to the "ROC curve and the measures of accuracy derived from it", the introduction to the review acknowledges that there are several choices. However, like many similar texts, it jumps quickly to examination of the area under the ROC curve (AUROC) without pausing to ask whether the AUROC is the right, or the best, measure. This omission leaves parts of the mathematics without the necessary "philosophical" foundation. It may also promote the misconception that an AUROC calculation is the purpose of drawing a ROC curve.
Now, by what criteria should performance measures be selected? The first criterion is that two tests that provide the same information should also score the same. However, if one reorders the test outcomes by, e.g., interchanging "atypical" and "equivocal" on the list from "normal" to "cancer", the geometry, and therefore the area, will change; nevertheless, the clinical import of each test outcome and, hence, the clinical merits of the test remain unchanged. In fact, it is easy to provide an example [see page 31 of Ref. (2)] in which a test classifies all patients correctly and yet has an AUROC of 0.5, suggesting complete lack of discrimination. A logical remedy is to assume the most favorable ordering when calculating the AUROC (3). Briefly, that means ordering the test outcomes by decreasing likelihood ratio, from ominous to reassuring. It means reassembling the segments of the ROC curve to make it concave. However, the review by Obuchowski et al. (1) does not address concavity issues.
Insistence on concavity is not enough: Even when tests have concave ROC curves, their AUROCs may fail to rank them correctly. My oldest example (3) involves a clinical context where, by virtue of suitable symmetries, two tests are demonstrably equally useful yet have AUROCs as different as 0.85 and 0.90. Although the report giving this example is cited by Obuchowski et al. (1) in other contexts, their review does not address this potential shortcoming of the AUROC.
Another compelling criterion is this: If a test can be emulated by superimposing pure noise, such as independent measurement error, on another test, it cannot be more informative than the latter and should never score higher; the same applies when test readings are binarized or otherwise coarsened.
Once one acknowledges that the AUROC may misleadwhich is my main pointit would be gratifying to be able to propose fail-safe ROC summary statistics. Unfortunately, there is a theoretical reason that any alternative statistics will have similar shortcomings: no performance measure based solely on the ROC curve and ignoring the pretest disease probability can rank competing tests without sometimes violating the expected utility principle of decision theory and, hence, clinical rationality (4). Performance measures must take pretest probabilities into account and be based on assessments of utility, i.e., clinical benefit and loss, or at least on a provisional utility model (pseudo-regret function) (3)(4).
Tests are then ranked by the expected pretestposttest difference in utility. For example, the simple quadratic (Brier) pseudo-regret function is J(r) = 4r(1 r), where r is a disease (vs no disease) probability; J(r) = 1 when r = 0.5, i.e., when the diagnosis is maximally uncertain, and falls off toward 0 as r approaches 0 or 1, i.e., diagnostic certainty. The expected pseudo-regret drops from J(p), p being the pretest disease probability, to:
![]() |
References
R1 Department of Biostatistics, The Cleveland Clinic Foundation, 9500 Euclid Ave., Cleveland, OH 44195
To the Editor:
I agree with Dr. Hilden that the area under the ROC curve has several limitations as a measure of a tests inherent diagnostic accuracy. Dr. Hilden lists some of these limitations in his letter; another important one is its global interpretation, which does not translate well into clinical practice (1). The partial area under the ROC curve in the clinically relevant range of false-positive rates, or false-negative rates, is a useful alternative measure of inherent diagnostic accuracy.
Dr. Hilden advocates performance measures that incorporate the pretest probability of disease and the clinical benefit and loss of correct and incorrect diagnoses. These measures are extremely useful, but they are not measures of inherent test accuracy.
Our recent review (2) focused on problems with the analysis of ROC curves, without critiquing issues of study design. I would encourage investigators to carefully consider the objectives of their study and use the most relevant measure of diagnostic test accuracy.
References
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |