|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Informatics and Statistics |
1 Université Pierre et Marie Curie-Paris; INSERM, UMR-S 707; unité de santé publique, Assistance Publique Hôpitaux de Paris, Hôpital Saint-Antoine, Paris, France; 2 Laboratoire Alphabio, Hôpital Ambroise Paré, Marseille, France; 3 Département de Biostatistiques, CDL Pharma, Marseille, France; 4 Service dAnatomie Pathologique, Assistance Publique Hôpitaux de Paris, Hôpital Beaujon, Clichy, France; 5 Service de Médecine Interne, Hôpital Pitié-Salpêtrière, Assistance Publique Hôpitaux de Paris, Paris, France.
aAddress correspondence to this author at: UMR-S 707, Faculté de Médecine Saint Antoine, 27, rue Chaligny, 75571 Paris cedex 12. Fax +33 1 44 73 84 53; e-mail carrat{at}u707.jussieu.fr.
| Abstract |
|---|
|
|
|---|
Methods: We performed a simulation study to assess the bias in estimating the accuracy measures when the distribution of fibrosis stages in the study sample do not fit the reference distribution in the population to which the indices are applied. We also estimated the type I error of the tests comparing these measures in 2 samples with different distributions of fibrosis stages. We illustrated the practical use of these measures by reanalyzing real data.
Results: Compared with the AUC or the C-statistic, the Obuchowski measure showed limited bias when the distribution of fibrosis stages in the study sample differed from the reference distribution. The type I error was strongly inflated with the AUC or the C-statistic but was preserved in the Obuchowski measure. When we compared noninvasive indices on real data, AUC analysis led to discordant results depending on how the fibrosis stages were grouped together. One single conclusion was drawn from the analysis based on the Obuchowski measure.
Conclusions: We recommend using the Obuchowski measure for assessing the diagnostic accuracy of noninvasive indices of fibrosis.
| Introduction |
|---|
|
|
|---|
During the last 10 years, because of the difficulties associated with liver biopsy, there has been a growing interest in noninvasive methods for assessing liver fibrosis (7). The method used to estimate the diagnostic accuracy of noninvasive methods is remarkably constant across the different studies and is based on the area under the ROC curve (AUC)1 (7)(9).
In this context, the AUC represents the probability that a noninvasive index will correctly rank 2 randomly chosen patients, 1 with a liver biopsy considered "diseased" and the other with a liver biopsy considered "normal" (10). The diagnostic accuracies of 2 noninvasive indices are compared using the 2 AUCs and an appropriate statistical test (11)(12).
The use of the AUC raises 2 methodological issues. First, its use is based on the assumption that the gold standard is binary, whereas fibrosis staging uses an ordinal scale. This difference implies that fibrosis stages in the study sample have to be aggregated into 2 groups, a process that can lead to discordant conclusions, depending on how the groups are aggregated. The C-statistic, which was introduced to estimate diagnostic accuracy for outcomes with more than 2 categories (13), can overcome this limitation, but has never been used to estimate the diagnostic accuracy of noninvasive indices.
Analysis based on the AUC can also be biased by the way in which the proportion of each stage of fibrosis in the sample fits the distribution in the reference population to which the indices are applied. As a result, the comparison of different AUCs based on samples with different stage distributions may be flawed (14). A recent report advocated standardizing the AUC for the distribution of fibrosis stages to deal with this source of variability, but the method is not straightforward and has not yet been validated from a statistical standpoint (15)(16).
To overcome these 2 methodological issues, Obuchowski recently proposed a measure that can be interpreted similarly to the AUC and can be used in situations in which the gold standard is not binary (17)(18)(19).
The aim of this study was to compare the AUC, C-statistic, and Obuchowski measures to assess the diagnostic accuracy of noninvasive fibrosis indices.
| Materials and Methods |
|---|
|
|
|---|
auc
The AUC of noninvasive indices is generally used to differentiate between patients with advanced fibrosis and patients with non advanced fibrosis. Several methods for estimating and comparing the AUCs have been described (20)(21). In this study we used a nonparametric estimate of the AUC, equivalent to the Mann–Whitney statistic (10)(11). (More details can be found in the Data Supplement that accompanies the online version of this article at http://www.clinchem.org/content/vol54/issue8.)
Briefly, the calculation relies on selecting every possible pair of patients, one with advanced fibrosis (AF, i.e., stages F2, F3 and F4 of the METAVIR score) and one with nonadvanced fibrosis (NAF, i.e., stages F0 and F1), and then evaluating if the noninvasive index correctly ranks the two patients. The estimated AUC is the proportion of all pairs in which the patient with advanced fibrosis has the higher value of the noninvasive index. It can be interpreted as the probability that the noninvasive index will correctly rank 2 randomly chosen patients, one with AF and one with NAF.
c-statistic (13)
The concordance (C)-statistic is an accuracy measure that can be used for ordinal or nominal outcomes. If we assume that there are N categories of the gold standard outcome (in this case, 5 fibrosis stages), calculation of the C-statistic requires selection of every possible pair of patients having different categories of the outcome and evaluation of the proportion of all pairs in which the noninvasive index correctly ranks the 2 patients. The C-statistic has the same interpretation as the AUC, i.e., the probability of correctly ranking 2 randomly chosen patients in 2 different categories. Like the AUC, the C-statistic depends on the distribution of fibrosis stages in the study sample.
obuchowski measure (19)
This measure is a multinomial version of the AUC. With N (= 5) categories of the gold standard outcome and AUCst, the estimate of the AUC of diagnostic tests for differentiating between categories s and t, the Obuchowski measure, is a weighted average of the N(N – 1)/2 (= 10) different AUCst corresponding to all the pairwise comparisons between 2 of the N categories. Weighting can be based on the relative proportion of the 5 fibrosis stages in the study sample, or, as in this case, on a reference distribution of fibrosis stages similar to that in the population.
Each pairwise comparison can also be weighted to take into account the distance between fibrosis stages (i.e., the number of units on the ordinal scale). We thus defined a penalty function proportional to the difference in METAVIR units between stages (the penalty function was 0.25 when the difference between stages was 1, 0.5 when the difference was 2, 0.75 when the difference was 3, and 1 when the difference was 4).
With a weighting scheme based on the relative proportion of fibrosis stages in the study sample and no penalty function, the Obuchowski measure is equivalent to the C-statistic. Note also that the AUC can be seen as a particular value of the Obuchowski measure, for which AUCst corresponding to pairwise comparisons of stages s and t belonging to the same aggregated category (i.e., AF or NAF) are not calculated, with a weighting scheme based on the relative proportion of stages in the study sample and no penalty function. In this latter case, if the weighting scheme is based on a reference distribution of stages in the population, an adjusted-to-the-stages distribution AUC (adjAUC) is estimated.
The Obuchowski measure can be interpreted as the probability that the noninvasive index will correctly rank 2 randomly chosen patient samples from different fibrosis stages according to the weighting scheme, with a penalty for misclassifying patients (see above).
comparison of noninvasive indices
A general method for comparing 2 or more AUCs derived from the same patient population has been published elsewhere (11). This method has also been extended to the Obuchowski measure (19). Assuming that
1 and
2 are the respective measures of diagnostic accuracy (the AUC, the C-statistic, or the Obuchowski measure) of 2 noninvasive indices, the value of the test statistic for assessing the null hypothesis (no difference in accuracy between the 2 indices) is:
![]() |
1) and vâr(
2)) and covariance (côv(
1,
2)) are described elsewhere (11)(19). All statistical tests were 2-tailed, with a type I error of 5%.
data set
The data on noninvasive indices used here come from a previously published report (22), derived from the Fibropaca study. Fibropaca was a French multicenter prospective cross-sectional study involving 519 patients that was performed in hepatogastroenterology units or internal medicine units of 5 centers in the southeast region, known for their expertise in hepatitis C (23). All the patients had chronic hepatitis C virus infection without liver complications. Liver biopsies were analyzed for the fibrosis stage in each center by the local pathologist, using the METAVIR scoring system. On the same day as the biopsy specimens were obtained, biochemical parameters were collected to assess several noninvasive markers in a subgroup of 235 patients.
Our analysis focused on APRI (aspartate aminotransferase–to–platelet ratio index) and Fibrotest (FT). FT is calculated from the patients age, sex, and 5 biochemical parameters:
2-macroglobulin, haptoglobin,
-glutamyl transpeptidase (GGT), total bilirubin, and apolipoprotein A1 (24). APRI is the ratio of the aspartate aminotransferase concentration to the platelet count (25).
simulation
We illustrated the variability and bias related to the discrepancy between the distributions of fibrosis stages in the study sample compared to its reference distribution in the population. We estimated the nonadjusted (AUC, C-statistic) and adjusted (adjAUC, Obuchowski measure) accuracy measures of FT in 1000 samples of size 235, sampled from the Fibropaca study with different distributions of fibrosis stages, namely a predominance of extreme stages [proportion of stage F0 (PF0) = 30%, PF1 = 10%, PF2 = 10%, PF3 = 20%, PF4 = 30%], and a predominance of intermediate stages (PF0 = 10%, PF1 = 30%, PF2 = 30%, PF3 = 20%, PF4 = 10%). To describe the population, a reference distribution of fibrosis stages in this setting was chosen (PF0 = 6%, PF1 = 39%, PF2 = 28%, PF3 = 14%, PF4 = 13%) (26). The true values of the AUC, the adjAUC, the C-Statistic, or the Obuchowski measure were empirically calculated from 1000 samples of size 235 under the reference distribution, and the bias was calculated by averaging the differences between each estimated measure and their corresponding true value. We also calculated the nominal coverage of 95% CIs, i.e., how often the true value was included in the 95% CI.
Finally, we evaluated the type I error of tests comparing the FT in 2 samples with different distribution of stages (extreme vs intermediate), based either on an adjusted accuracy measure (adjAUC and Obuchowski measure) or a nonadjusted accuracy measure (AUC or C-statistic). All calculations were performed using R.
fibropaca analysis
We used the Obuchowski measure to assess the accuracy of FT and APRI for diagnosing the stage of liver fibrosis. We compared these results to those obtained by using AUC analysis, and the AUCst values corresponding to each pairwise comparison were plotted to explain discrepancies and to illustrate the consequences of grouping fibrosis stages.
| Results |
|---|
|
|
|---|
|
With the extreme distribution, the nominal coverage of 95% CIs was 28% for the AUC, meaning that only 28% the 95% CI contained the true AUC value. The nominal coverage of 95% CI was 60% for the C-statistic, whereas the adjusted measures had 95% coverage as expected.
type i error
When comparing the same index (FT) between samples with intermediate vs extreme stage distributions, the type I error was 42% with the AUC and 33% with the C-statistic, while the corresponding values were 6% with adjAUC and 5% with the Obuchowski measure.
analysis of fibropaca data
With the METAVIR system, the fibrosis stages were 14% F0, 43% F1, 18% F2, 18% F3, and 7% F4. Fig. 2
shows the distribution of FT and APRI according to the fibrosis stage. The AUCs of FT and APRI for the diagnosis of advanced fibrosis (
F2) were, respectively, 0.81 (95% CI, 0.76–0.87) and 0.74 (95% CI, 0.67–0.80). The difference was statistically significant (P = 0.02), leading to the conclusion that FT has greater accuracy than APRI. For the diagnosis of cirrhosis (F4), there was no significant difference (P = 0.82) between FT [AUC = 0.82 (95% CI, 0.73–0.92)] and APRI [AUC=0.83 (95% CI, 0.72–0.95)]. To understand why we found discordant results, we plotted the AUCs of all pairwise comparisons of fibrosis stages (Fig. 3
). When calculating the AUC for the diagnosis of significant fibrosis, we averaged 6 pairwise comparisons, namely F0 or F1 with F2, F3, or F4. Among these, FT was more accurate in 5 comparisons. For the diagnosis of cirrhosis, 4 pairwise comparisons were averaged, namely F0, F1, F2, and F3 with F4. APRI was more accurate in 3 comparisons, but the averaged difference was not significant because the F1 vs F4 comparison favored FT and included a larger number of patients. When the analysis used adjAUC, similar conclusions were drawn in all comparisons.
|
|
We then reanalyzed the diagnostic accuracy of these 2 noninvasive markers with the measures designed for ordinal gold standards. The C-statistic values of FT and APRI were, respectively, 0.75 (95% CI, 0.71–0.80), and 0.71 (95% CI, 0.66–0.75) (P = 0.053). The Obuchowski measures of FT and APRI were, respectively, 0.80 (95% CI, 0.75–0.84) and 0.75 (95% CI, 0.69–0.81), P = 0.09. In the Obuchowski measures, 10 pairwise comparisons were averaged, among which FT was more accurate in 6 comparisons. A single conclusion was drawn: in the population, and considering the penalty function, FT would not be more accurate than APRI for predicting the fibrosis stage.
| Discussion |
|---|
|
|
|---|
Here we present a new measure, initially developed by Obuchowski for nonbinary gold standards, and show how it can be used to evaluate the accuracy of noninvasive indices. This measure summarizes all pairwise comparisons of fibrosis stages defined by liver biopsy, with a weighting scheme and a penalty function.
The Obuchowski measure has several advantages over the AUC. By using a weighting scheme based on a reference distribution, we eliminated the bias related to the distribution of fibrosis stages and corrected the inflated type I error. This bias is the consequence of a spectrum effect, which has been widely discussed in the literature since the introductory paper by Ransohoff and Feinstein (27). By using Obuchowski measure with the same weighting scheme, results from different studies could easily be compared or combined in a meta-analysis, and the spectrum effect is controlled. Moreover, the Obuchowski measure can be used to estimate an adjusted-on-fibrosis-stages distribution AUC, by omitting pairwise comparisons of stages belonging to the same aggregated category and with no penalty function.
Another advantage of the Obuchowski measure is that AUC analyses require the results of liver biopsy to be aggregated into 2 outcomes. Numerous studies have shown different AUCs for the same noninvasive index, owing to different ways of grouping fibrosis stages; this procedure can be interpreted as subgroup analysis. When comparing 2 noninvasive indices, this approach would imply multiple testing of several AUCs, which would require appropriate correction for the type I error. It can also lead to discrepancies in the results, which complicate their interpretation, as seen in our reanalysis of the Fibropaca data. In contrast, the Obuchowski measure allows 2 noninvasive indices to be compared with a single test. However, the study of statistical power is not straigthforward and will depend on the weighting scheme and on how the penalty function is parameterized. Most notably the power will also depend on the homogeneity of the difference between 2 indices of the AUCst for discriminating between 2 categories in the ordinal outcome.
Third, the use of a weighting scheme and a penalty function increase the clinical relevance of the Obuchowski measure. Measures of diagnostic accuracy should ideally reflect real-life conditions. Clearly, the medical consequences of misclassifying an F0 patient as F1 are less serious than if the same patient is misclassified as F4.
The choice of a linear penalty function to quantify the difference between observed and predicted fibrosis stages is open to discussion (28)(29). Other penalty functions might be used, more closely related to the true difference in fibrosis between different stages, or based on the clinical consequences of misclassifying a patient. This deserves further studies.
Here we analyzed the results obtained with the FT and APRI indices in the Fibropaca study, but the Obuchowsi measure has far wider potential applications in the more general field of diagnostic tests with nominal or ordinal outcomes. The method has been successfully applied to assess the accuracy of magnetic resonance imagery for diagnosing damage to heart tissue after myocardial infarction (17) and to assess physician accuracy in diagnosing the cause of abdominal pain in children (19), yet the Obuchowsi measure is still not widely used. It could also be used to assess the usefulness of ordinal or polytomous regression model for differentiating between more than 2 outcomes. To our knowledge, ordinal or polytomous regressions have never been considered for noninvasive indices and are still rarely used in diagnostic research (30). In a recent comparison of dichotomous and polytomous regression analyses for diagnosing serious bacterial infections (31), 3 outcomes were studied and 3 AUCs (presence of 1 outcome vs absence) were calculated for each estimated model. The Obuchowski measure could have been used instead and would have permitted the comparison of the discriminating performance of these models with a single metric.
To conclude, we recommend that future studies of noninvasive methods for assessing fibrosis use the Obuchowski measure instead of the AUC to assess diagnostic accuracy. For greater clinical relevance, we recommend a weighting scheme based on a fibrosis stage distribution as close as possible to that in the reference population, and a penalty function proportional to the difference between fibrosis stages.
| Acknowledgments |
|---|
Financial Disclosures: None declared.
Acknowledgments: We thank David Young and Anders Boyd for their help in editing the manuscript.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |