Clinical Chemistry
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Clinical Chemistry 51: 471-472, 2005; 10.1373/clinchem.2004.041376
This Article
Right arrow Extract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Web of Science (4)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Hilden, J.
Right arrow Articles by Obuchowski, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hilden, J.
Right arrow Articles by Obuchowski, N.
Related Collections
Right arrow Laboratory Management
Right arrow Current Issues in Laboratory Medicine
(Clinical Chemistry. 2005;51:471-472.)
© 2005 American Association for Clinical Chemistry, Inc.


Letters to the Editor

What Properties Should an Overall Measure of Test Performance Possess?

Jørgen Hilden

1 Department of Biostatistics, University of Copenhagen, Blegdamsvej 3, DK-2200 Copenhagen N, Denmark, Fax 45-35-327907

aE-mail j.hilden{at}biostat.ku.dk


To the Editor:

Obuchowski et al. (1) recently reviewed the uses and misuses of ROC curves and explained, in simple terms, some sophisticated solutions to frequent problems, thereby popularizing tools described elsewhere, notably in the book coauthored by Nancy Obuchowski (2). What prompts me to comment on this careful and timely review are two fundamental questions:

What properties should an overall measure of test performance possess? How do we make better tests score better?

ROC curves are indispensable and conceptually straightforward, but selecting a single overall performance measure requires some care. In referring to the "ROC curve and the measures of accuracy derived from it", the introduction to the review acknowledges that there are several choices. However, like many similar texts, it jumps quickly to examination of the area under the ROC curve (AUROC) without pausing to ask whether the AUROC is the right, or the best, measure. This omission leaves parts of the mathematics without the necessary "philosophical" foundation. It may also promote the misconception that an AUROC calculation is the purpose of drawing a ROC curve.

Now, by what criteria should performance measures be selected? The first criterion is that two tests that provide the same information should also score the same. However, if one reorders the test outcomes by, e.g., interchanging "atypical" and "equivocal" on the list from "normal" to "cancer", the geometry, and therefore the area, will change; nevertheless, the clinical import of each test outcome and, hence, the clinical merits of the test remain unchanged. In fact, it is easy to provide an example [see page 31 of Ref. (2)] in which a test classifies all patients correctly and yet has an AUROC of 0.5, suggesting complete lack of discrimination. A logical remedy is to assume the most favorable ordering when calculating the AUROC (3). Briefly, that means ordering the test outcomes by decreasing likelihood ratio, from ominous to reassuring. It means reassembling the segments of the ROC curve to make it concave. However, the review by Obuchowski et al. (1) does not address concavity issues.

Insistence on concavity is not enough: Even when tests have concave ROC curves, their AUROCs may fail to rank them correctly. My oldest example (3) involves a clinical context where, by virtue of suitable symmetries, two tests are demonstrably equally useful yet have AUROCs as different as 0.85 and 0.90. Although the report giving this example is cited by Obuchowski et al. (1) in other contexts, their review does not address this potential shortcoming of the AUROC.

Another compelling criterion is this: If a test can be emulated by superimposing pure noise, such as independent measurement error, on another test, it cannot be more informative than the latter and should never score higher; the same applies when test readings are binarized or otherwise coarsened.

Once one acknowledges that the AUROC may mislead—which is my main point—it would be gratifying to be able to propose fail-safe ROC summary statistics. Unfortunately, there is a theoretical reason that any alternative statistics will have similar shortcomings: no performance measure based solely on the ROC curve and ignoring the pretest disease probability can rank competing tests without sometimes violating the expected utility principle of decision theory and, hence, clinical rationality (4). Performance measures must take pretest probabilities into account and be based on assessments of utility, i.e., clinical benefit and loss, or at least on a provisional utility model (pseudo-regret function) (3)(4).

Tests are then ranked by the expected pretest–posttest difference in utility. For example, the simple quadratic (Brier) pseudo-regret function is J(r) = 4r(1 – r), where r is a disease (vs no disease) probability; J(r) = 1 when r = 0.5, i.e., when the diagnosis is maximally uncertain, and falls off toward 0 as r approaches 0 or 1, i.e., diagnostic certainty. The expected pseudo-regret drops from J(p), p being the pretest disease probability, to:

where m(x) is the marginal probability of test result x, and r(x) is the conditional disease probability given x; m(x) and r(x) clearly depend on p as well as on data contained in the ROC curve. These pretest–posttest differences, also known as measures of value of information (VOI) (5)(6), do satisfy the basic criteria above, and their increasing popularity seems well deserved.


References

  1. Obuchowski NA, Lieber ML, Wians FH, Jr. ROC curves in Clinical Chemistry: uses, misuses, and possible solutions [Review]. Clin Chem 2004;50:1118-1125.[Abstract/Free Full Text]
  2. Zhou X-H, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine 2002:437pp Wiley & Sons Interscience New York. .
  3. Hilden J. The area under the ROC curve and its competitors. Med Decis Making 1991;11:95-101.
  4. Hilden J. Prevalence-free utility-respecting summary indices of diagnostic power do not exist. Stat Med 2000;19:431-440.[CrossRef][Web of Science][Medline] [Order article via Infotrieve]
  5. Yokota F, Thompson KM. Value of information literature analysis: a review of applications in health risk management. Med Decis Making 2004;24:287-298.[Abstract/Free Full Text]
  6. Ades AE, Lu G, Claxton K. Expected value of sample information calculations in medical decision modelling. Med Decis Making 2004;24:207-227.[Abstract/Free Full Text]

Dr. Obuchowski responds:

Nancy Obuchowski

R1 Department of Biostatistics, The Cleveland Clinic Foundation, 9500 Euclid Ave., Cleveland, OH 44195


To the Editor:

I agree with Dr. Hilden that the area under the ROC curve has several limitations as a measure of a test’s inherent diagnostic accuracy. Dr. Hilden lists some of these limitations in his letter; another important one is its global interpretation, which does not translate well into clinical practice (1). The partial area under the ROC curve in the clinically relevant range of false-positive rates, or false-negative rates, is a useful alternative measure of inherent diagnostic accuracy.

Dr. Hilden advocates performance measures that incorporate the pretest probability of disease and the clinical benefit and loss of correct and incorrect diagnoses. These measures are extremely useful, but they are not measures of inherent test accuracy.

Our recent review (2) focused on problems with the analysis of ROC curves, without critiquing issues of study design. I would encourage investigators to carefully consider the objectives of their study and use the most relevant measure of diagnostic test accuracy.


References

  1. Zhou X-H, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine 2002:437pp Wiley & Sons Interscience New York. .
  2. Obuchowski NA, Lieber ML, Wians FH, Jr. ROC curves in Clinical Chemistry: uses, misuses, and possible solutions. Clin Chem 2004;50:1118-1125.




This Article
Right arrow Extract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Web of Science (4)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Hilden, J.
Right arrow Articles by Obuchowski, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hilden, J.
Right arrow Articles by Obuchowski, N.
Related Collections
Right arrow Laboratory Management
Right arrow Current Issues in Laboratory Medicine


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS