|
|
||||||||
Evidence-based Laboratory Medicine and Test Utilization |
1 ACOMED Statistik, Leipzig, Germany.
2 Bayer Vital GmbH, Leverkusen, Germany.
3 Department of Urology, University Hospital Charité, Humboldt University Berlin, Berlin, Germany.
4 Department of Urology, Kantonsspital Aarau, Aarau, Switzerland.
5 Department of Urology, University Hospital Münster, Münster, Germany.
6 Department of Urology, University of Duisburg-Essen, Essen, Germany.
7 Department of Urology, Städtisches Klinikum Braunschweig, Braunschweig, Germany.
aAddress correspondence to this author at: Department of Urology, University Hospital Charité, Humboldt University Berlin, Schumannstrasse 20/21, D-10098 Berlin, Germany. Fax 49-30-450-515904; e-mail klaus.jung{at}charite.de.
| Abstract |
|---|
|
|
|---|
Methods: The DAC method is based on a generalization of the McNemar test so that for a given pair of cutoff values only those patients are analyzed who are categorized differently by the two tests compared. The analyses are performed for all cutoff pairs that deliver identical sensitivities for both tests. We used data for total (tPSA) and complexed prostate-specific antigen (cPSA) from a recently published multicenter study to demonstrate the DAC method.
Results: The example shows that ROC analyses of subgroups can give contradictory results about the diagnostic accuracy of two markers, depending on the marker used for the selection of subgroups. The DAC method avoids artifacts attributable to questionable selection of subgroups and facilitates overall and local comparisons of the diagnostic accuracy of tests. The DAC results of the analyzed data set suggest that cPSA has higher diagnostic accuracy than does tPSA.
Conclusions: The DAC method is a suitable tool for comparing the clinical usefulness of laboratory markers. The DAC method could be considered as an additional tool to ROC analysis and could replace comparative ROC analyses of diagnostic tests, especially within subgroups defined by only one of the markers.
| Introduction |
|---|
|
|
|---|
Depending on the clinical use of a marker, only one portion of the ROC curve, the local performance of a test within a restricted range of false-positive rates, is often of higher importance than the overall performance represented by the AUC. Furthermore, ROC calculations always analyze the whole data set. Therefore, samples in nonrelevant ranges such as very low or high concentrations influence the analysis of diagnostic performance in the relevant concentration range and may complicate appropriate interpretation of the resulting graph and data, particularly when comparing the diagnostic accuracy of two tests.
To overcome these limitations of ROC curves, alternative indices of ROC curves have been proposed, e.g., partial areas using particular specificity intervals (5)(6). However, these approaches have not been widely used because of their methodologic complexity. Current practice, therefore, is to try to overcome this disadvantage of ROC analysis by analyzing the data using only a subgroup with a limited concentration range. In this respect, comparison between the clinical utility of total prostate-specific antigen (tPSA) and complexed PSA (cPSA) is a good example (7)(8)(9). When the procedure that defines a subgroup by a range of one test (A) and compares the diagnostic performance of both tests A and B is used, artifacts can result at the lower and upper ends of the concentration range of the subgroup. For example, if the lower end of the sample population is defined only by test A, then patients who are false negative for test A but true positive for test B are excluded from analysis (to the disadvantage of test B). Similar selection artifacts exist for the upper end of the concentration range of the subgroups sample population. Misleading interpretations may result from these selection artifacts. This may be, at least partially, the reason for the contradictory conclusions of different working groups analyzing the diagnostic performance of cPSA vs tPSA (7)(8)(9)(10)(11)(12).
In this report, we present a new, easy-to-use approach for routine analysis, which we have called discordance analysis characteristics (DAC), that helps to avoid the above-mentioned disadvantages. The analysis is based on a generalization of the McNemar test so that for a given pair of cutoff values only those patients are analyzed who are categorized differently by both tests. We demonstrate the potential usefulness of the method, using the example of cPSA vs tPSA with the data from a previously published multicenter study (7).
| Materials and Methods |
|---|
|
|
|---|
PSA concentrations were measured by the Bayer Immuno 1 PSA and cPSA assays (Bayer Diagnostics) as described previously (7)(13)(14).
principle of dac analysis
The basic approach of the DAC method can be exemplified by use of a scatterplot of the tPSA and cPSA values (Fig. 1
) of the multicenter study (7). In general, continuous-like data within a continuous distribution are assumed, which is the case in the analysis of laboratory analytes. When a pair of cutoff values (COA and COB) for tests A and B, respectively, are used, four quadrants (Q1Q4) result. The cutoff pairs for the definition of the quadrants use the criterion that quadrants 2 and 4 should contain the same number of true positives (PCa cases), i.e., COA and COB deliver identical sensitivity. This criterion refers to recommendations for the diagnostic evaluation of tumor markers (15). Quadrants 1 and 3 contain cases categorized equivalently by test A and test B: "negative" in quadrant 1 (i.e., below the cutoff) and "positive" in quadrant 3 (i.e., above the cutoff). As the first step in the DAC approach, the selection of the samples used for analysis, quadrants 2 and 4 contain those cases that are relevant for the comparison of both tests because they were categorized discordantly by both tests. Quadrant 2 contains cases with negative results for test A and positive results for test B, whereas quadrant 4 contains cases with positive results for test A and negative results for test B. Thus, for each pair of cutoffs, the (Q2 + Q4) subpopulation of Q2 samples plus Q4 samples includes those samples that cause possible differences in diagnostic accuracy between the tests. In the second step, we analyze the properties of the three local subpopulations selected in step 1: Q2 samples, Q4 samples, and/or (Q2 + Q4) samples. For that purpose, the true-positive (TPA and TPB), true-negative (TNA and TNB), false-positive (FPA and FPB), and false-negative (FNA and FNB) test results are counted for both tests A and B. There are several possibilities for analyzing these counts. For the analysis of (Q2 + Q4) samples, we use a specificity-resembling parameter: the DAC specificities in (Q2 + Q4) for test A are defined as DAC-SPECA = TNA/(TNA + FPA) and accordingly for DAC-SPECB. The comparative analysis of Q2 samples vs Q4 samples is performed with a parameter resembling positive predictive value (PPV): the DAC-PPV for test A is defined as DAC-PPVA = TPA/(TPA + FPA) using the cases in Q4. The definition for test B is set accordingly using only Q2 cases.
|
higher values of dac specificity or dac-ppv for one test indicate its superior diagnostic accuracy
It should be noted that the sum of DAC-SPECA and DAC-SPECB always equals 1. Accordingly, it is true that DAC-PPVA + DAC-NPVB = 1 and DAC-PPVB + DAC-NPVA = 1. These effects are attributable to the equivalencies of TPA = FNB, TPB = FNA, FPA = TNB, and FPB = TNA. Therefore, only one test needs to be graphed for DAC specificities. Similarly, only one of the parameters PPV or negative predictive value (NPV) must be analyzed.
In a third step, these calculations are done for all cutoff pairs (i = 1 ... n), where n is the number of all possible cutoff pairs regarding the criterion mentioned above. The parameters DAC-SPEC and DAC-PPV are graphed over COA,i and COB,i by use of two x axes. Alternatively, sensitivity could be used as the x axis.
To estimate the significance of different diagnostic accuracies, we suggest calculating the differences between DAC-SPECB and DAC-SPECA and between DAC-PPVB and DAC-PPVA for each pair of cutoffs. The pointwise confidence intervals (CIs) can then be calculated using the methods related to the difference of two proportions with formulas given by Altman (16). In the case of DAC-SPEC, paired observations must be considered, whereas nonpaired proportions for DAC-PPV can be assumed. For both DAC-SPEC and DAC-PPV, differences >0 would be indicative of superior diagnostic accuracy for test B, and lower limits of the CIs >0 would indicate significance of the result.
This approach, leading to one or several graphs characterizing the discordant test results, is called the DAC method. A suitable computer program has been developed. (Copies of the program can be obtained from Dr. Keller at thomas.keller{at}acomed.de or www.acomed-statistics.com/dac-method.html.)
Sometimes it may be appropriate to consider a summary measure. We propose the use of medians of DAC-SPEC and DAC-PPV values and the medians of ratios of these parameters: RDAC-SPEC = median (DAC-SPECB,i/DAC-SPECA,i) and RDAC-PPV = median (DAC-PPVB,i/DAC-PPVA,i), respectively. For example, a value >1 for the latter medians would indicate the better diagnostic accuracy of marker B. However, like the AUC of ROC curves, these overall measures do not provide information about the local diagnostic performance.
statistical analysis
DAC-SPEC, DAC-PPV, and the pointwise CIs of their differences were calculated as described above (16). An assay was estimated as superior if the related DAC-SPEC and DAC-PPV values were higher than those of the comparative assay. For graphical presentations, raw data (counts) were smoothed by use of a triangular smoothing function (17).
All calculations and graphs for ROC analysis were made with an ExcelTM (version XP for Windows; Microsoft Corporation) software program (www.acomed-statistics.com/roc-tools.html). Differences in ROC curves were estimated according to DeLong et al. (3). CIs for the AUC were calculated according to Hanley and McNeil (2). The significances of the overall parameters [medians of DAC-SPEC and DAC-PPV and medians of their ratios (RDAC-SPEC and RDAC-PPV)] were considered on the basis of the 95% CIs of their medians calculated by bootstrapping (18), using 10 000 bootstrap replicates. The method was programmed using the statistical computer program R (19)(20).
| Results |
|---|
|
|
|---|
|
To estimate the diagnostic performance only in the interesting range 24 µg/L, a subgroup analysis seems appropriate. Scatterplots for the subgroup of patients with tPSA concentrations in the range 24 µg/L and corresponding cPSA concentrations in the range 1.513.19 µg/L are shown in panels A and B, respectively, of Fig. 3
. The graphs demonstrate that, particularly at the edges of the selected concentration range, different patients are included in such a subgroup analysis. As shown in the ROC curves (Fig. 3
, C and D), cPSA-based selection of patients leads to a significant difference between AUCs (P <0.02), whereas tPSA-based selection fails to show this difference (P = 0.15). Furthermore, the absolute values of the AUC obtained by the different selection procedures differ (tPSA, 0.55 vs 0.48; cPSA, 0.58 vs 0.53) as do the positions and shapes of the ROC curves.
|
The results obtained with the DAC method (Fig. 4
) show the DAC-SPEC and DAC-PPV values as well as the calculated differences of DAC-SPEC and DAC-PPV graphed over the cutoffs of both analytes. DAC-SPEC and DAC-PPV values were significantly higher for cPSA in a wide range of tPSA between
2.5 and 5.8 µg/L, corresponding to cPSA values of
1.9 to 4.8 µg/ L, as indicated by the positive values for the lower CI.
|
The median DAC-SPECcPSA value of 0.78 (95% CI, 0.610.87) differed significantly from the DAC-SPECtPSA (0.22; 95% CI, 0.130.39). Results were similar for the DAC-PPV of 0.63 (95% CI, 0.530.82) for cPSA vs 0.31 (95% CI, 0.270.50) for tPSA. The CIs of the medians of both pairs of DAC-SPEC for cPSA and tPSA and of DAC-PPV, respectively, did not overlap and indicated a significant difference between the medians. The medians of the ratios RDAC-SPEC and RDAC-PPV, explained in the Materials and Methods, were calculated to be RDAC-SPEC = 3.57 (95% CI, 1.536.67) and RDAC-PPV = 2.01 (95% CI, 1.302.69) and differed significantly from 1.
| Discussion |
|---|
|
|
|---|
Conclusions about the diagnostic validity of cPSA vs tPSA have generally been based on ROC analysis taking into account the AUC values and partly the comparison of sensitivity and specificity at certain cutoffs. Only a few studies exist for the low tPSA range <4 µg/L (7)(8)(9)(11). In a multicenter study including more than 500 men with tPSA <4 µg/L, the differences between cPSA and tPSA in differentiating men with PCa and men without PCa were not clearly demonstrated (7). Although a significantly larger AUC for cPSA in the tPSA range 2.54 µg/L was found, differences in the specificities of cPSA vs tPSA at the selected sensitivities of 80%, 85%, 90%, and 95% were not statistically significant for all sensitivity values. Two similar multicenter studies of men with tPSA concentrations <4 µg/L described improved detection of PCa by cPSA based on differences between the AUCs (8)(11). In addition to the different clinical settings used in these studies, one reason for these discrepancies may be, at least partially, attributed to selection artifacts at the edges of narrow tPSA ranges as demonstrated in Fig. 3
. Therefore, in regard to the conventional strategy of comparative ROC analysis, the results of various studies for the evaluation of the diagnostic impact of cPSA should be considered with caution.
These uncertainties in analyzing the data and interpreting the study results were the starting point for us to develop the DAC method. As described in the Results and demonstrated in Fig. 4
, DAC analysis allows description of the overall and local differences in the clinical utility of both tests and suggests a significant advantage of cPSA over tPSA, as indicated by higher values for DAC-SPEC and DAC-PPV, respectively. The results are caused by the lower number of FP samples for cPSA compared with a higher number of FP samples for tPSA among the patients with discordant test results. The clinical impact of these results will be discussed in a separate report.
comparing roc and dac
The comparison of the results of ROC and DAC analyses of our reevaluated data and the corresponding conclusions make it necessary to discuss the utility of both methods.
The major disadvantage of ROC analysis was described in the introductory paragraph and is caused by the property of the ROC approach that gives equal weight to all FP rates (5). Therefore, when comparing two tests, the performances of both assays near the cutoffs are difficult to describe by use of ROC curves of the whole data set. To overcome this problem, it is current practice to perform ROC analysis on subgroups of the data set defined by a limited concentration range of one of the markers in question. However, this approach is subject to severe biases resulting from selection effects when the subgroups are defined by a concentration range of only one of the assays in question, as can be seen in Fig. 3
. In conclusion, the diagnostic performance of one assay alone around a cutoff cannot be described in a representative way, nor is it possible to get an error-free comparative analysis of the relative performance of two diagnostic tests.
In contrast, the initial step of the DAC method is an error-free, clearly defined selection of local subpopulations. The method focuses on the discordant test results. It considers exactly those cases that solely are responsible for differences in diagnostic accuracy. Selection artifacts are thus avoided. The DAC approach may make the comparative ROC subgroup analyses unnecessary.
Whereas it is current practice to combine the results of several subgroup analyses of different ranges to describe the performance in selected ranges, the DAC approach leads to meaningful and easy-to-read data and graphs for the comparison of tests within only one analysis. The concentration ranges with different diagnostic accuracies can immediately be identified. In terms of hypothesis testing, the null hypothesis of no difference can be tested at prospectively chosen cutoffs or ranges of cutoffs. Furthermore, comparison of the results of different studies can be simplified because the result of DAC analysis (e.g., DAC-SPEC and DAC-PPV) is not influenced by any subgroup selection. This is in contrast to a ROC analysis, in which subgroup selections with different ranges around a given cutoff would lead to different values for sensitivity and specificity.
In our example, the differences between cPSA and tPSA are smaller at the upper end of the concentration range (close to 6 µg/L) of the sample population. This is attributable to the inclusion criterion of the study samples (tPSA <6 µg/L), which leads to underrepresentation of FP values in Q4 compared with Q2 in this concentration range.
In addition to this pointwise analysis, the DAC method gives a valid overall picture. Medians of DAC-SPEC and DAC-PPV or the median of the ratios RDAC-SPEC and RDAC-PPV characterize the overall performance, whereas their CIs estimate the corresponding significance level.
Unlike the ROC analysis, which is based on and limited to the calculation of sensitivities and specificities, the DAC method is primarily a selection tool for subpopulations responsible for differences in the diagnostic accuracy of tests. The DAC method paves the way for a new possibility of separation of study populations: properties of Q2 samples can be evaluated vs the properties of Q4 samples, which may provide data of clinical relevance. Here we focus on the test results, such TP and TN values, but it would also be possible to perform DAC analysis on variables such as age or tumor stage. These approaches allow deeper insights into causes or consequences of different test results and will be presented in a separate report.
Regarding the parameters DAC-SPEC and DAC-PPV analyzed here, one has to take note of their interesting properties, which are attributable to the equivalencies described in the Materials and Methods and lead to a simplification of analysis. The DAC-SPEC values of both tests add up to 1, and DAC-PPV and DAC-NPV depend on each other.
The DAC-PPV used here is strongly related to the physicians decision-making because it refers to the proportion of people with a positive test who have the target disorder (5)(23). The prevalence dependency of this parameter must be taken into account, however. For example, low prevalences in screening settings would lead to lower DAC-PPV values. This should affect the DAC-PPV values of both tests in a quite similar manner. However, the aim of the DAC method is not to calculate DAC-PPVs as absolute values but to compare them to assess differences in diagnostic accuracy. The hypothesis test regarding differences in DAC-PPV does not depend on prevalence. Furthermore, the ratio of the two values strongly reduces the prevalence dependency.
As can be seen in Fig. 1
, the DAC method is quite easy to use. We programmed a calculating tool that can be used as an add-in within Excel (Microsoft Corp.). (Copies of the program can be obtained from Dr. Keller at thomas.keller{at}acomed.de or www.acomed-statistics.com/dac-method.html.)
In practice, there are two difficulties to be solved before applying DAC analysis: First, it is necessary to define the corresponding pairs of cutoffs. In this report, we described the determination using the criterion of equal numbers of TP test results. However, particularly in the case of low numbers of these cases, difficulties occur because of a strong influence of the local distribution of these cases in the scatterplot. Therefore, similar criteria, such as equal numbers of TN values or equal numbers of cases, can be applied. The regression approach according to Passing and Bablok (24) applied to all patients or only to positive cases is also helpful. It should be mentioned that with the data set analyzed here all four possibilities lead to similar results.
The second difficulty results from the low absolute numbers in quadrants Q2 and Q4 in ranges of low sample density. This would give rise to excessive scatter and fluctuation of the resulting curves. Although the overall result (medians compared by bootstrapping) remains unaffected, the graph is hard to read. For this reason, appropriate smoothing procedures should be used before graphing of the results. For the DAC method, simple smoothing procedures such as use of weighted means are sufficient.
In summary, the DAC method represents an adequate analytical tool for comparing the diagnostic performance of two assays. The possibility to assess in detail the local performance of two tests, e.g., close to clinically relevant cutoffs, without compromising the overall picture and avoiding selection artifacts of subgroups are positive features of the DAC method. The DAC method could be used in parallel with ROC analysis of the complete sample population and could replace comparative ROC analyses of subgroups.
| Acknowledgments |
|---|
| Footnotes |
|---|
| References |
|---|
|
|
|---|
Read all eLetters
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |