|
|
||||||||
Review |
1 Department of Clinical Epidemiology, Biostatistics and Bioinformatics.
2 The Dutch Cochrane Centre, Academic Medical Center, University of Amsterdam, The Netherlands.
3 Department of Medicine and Aging, School of Medicine and Aging Research Center, Ce.S.I., "Gabriele DAnnunzio" University Foundation, Chieti-Pescara, Italy.
4 Department of Public Health and Epidemiology, University of Birmingham, Edgbaston, Birmingham, United Kingdom.
aAddress correspondence to this author at: Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University of Amsterdam, P.O. Box 22700, 1100 DE Amsterdam, The Netherlands. Fax 0031-20-6912683; e-mail m.m.leeflang{at}amc.uva.nl.
| Abstract |
|---|
|
|
|---|
Methods: We evaluated the methodological quality of 487 diagnostic-accuracy studies in 30 systematic reviews with the QUADAS (Quality Assessment of Diagnostic-Accuracy Studies) checklist. We applied 3 strategies that varied both in the definition of quality and in the statistical approach to incorporate the quality-assessment results into metaanalyses. We compared magnitudes of diagnostic odds ratios, widths of their confidence intervals, and changes in a hypothetical clinical decision between strategies.
Results: Following 2 definitions of quality, we concluded that only 70 or 72 of 487 studies were of "high quality". This small number was partly due to poor reporting of quality items. None of the strategies for accounting for differences in quality led systematically to accuracy estimates that were less optimistic than ignoring quality in metaanalyses. Limiting the review to high-quality studies considerably reduced the number of studies in all reviews, with wider confidence intervals as a result. In 18 reviews, the quality adjustment would have resulted in a different decision about the usefulness of the test.
Conclusions: Although reporting the results of quality assessment of individual studies is necessary in systematic reviews, reader wariness is warranted regarding claims that differences in methodological quality have been accounted for. Obstacles for adjusting for quality in metaanalyses are poor reporting of design features and patient characteristics and the relatively low number of studies in most diagnostic reviews.
| Introduction |
|---|
|
|
|---|
The methodological quality of studies can be defined in terms of their susceptibility to bias. Studies with methodological shortcomings, such as inclusion of healthy control individuals or selective use of multiple reference standards to verify index test results, have produced different measures of test accuracy (1)(2)(3)(4)(5). In most cases, such deficiencies have been associated with inflated estimates of diagnostic accuracy. The inclusion of lower-quality studies in a metaanalysis may therefore produce unrealistically high-accuracy estimates. Accounting for quality differences can be expected to produce less optimistic summary estimates of diagnostic accuracy.
Design feature variability and the presence of studies with suboptimal designs in a systematic review may also increase heterogeneity in results among studies (6)(7)(8). Given these considerations, one can expect strategies that account for quality in metaanalyses of diagnostic accuracy to lead to more homogeneous results and therefore to more precise estimates, with narrower confidence intervals around the accuracy measures of interest than estimates without quality adjustment.
Quality assessment of individual studies in a review may identify both design deficiencies that can lead to bias and sources of variation that can lead to heterogeneity. Several quality-assessment tools, most of which use a "checklist" approach, have been developed for diagnostic-accuracy studies (5). A recently developed generic quality-assessment tool based on a modified Delphi procedure (5)(9) has been recommended by the Cochrane Collaboration as a starting point for quality assessment in diagnostic reviews (10).
Although quality appraisal has been recognized as an essential step of systematic reviews, how study quality should be addressed in metaanalyses of diagnostic-accuracy studies is less clear (5)(11). Strategies to incorporate study quality into metaanalyses can be broadly divided into 3 categories: including all studies, irrespective of quality; analyzing subgroups that differ in quality; and multivariable regression analysis. The slightly different recommendations given in the guiding reports are all based on sparse evidence (12)(13)(14).
To test the hypothesis that adjustment for quality produces less optimistic estimates of diagnostic accuracy and narrower confidence intervals, we compared 3 different strategies for incorporating quality in analyzing a number of previously published systematic reviews of diagnostic-accuracy studies.
| Materials and Methods |
|---|
|
|
|---|
study set
To include a broad sample of diagnostic studies that examined a variety of tests over time, we conducted a systematic electronic search for systematic reviews of diagnostic-accuracy studies published between January 1999 and April 2002 (5). This search produced a set of 28 reports of systematic reviews (see appendix in the Data Supplement that accompanies the online version of this article at http://www.clinchem.org/content/vol53/issue2). Details of the search strategy are available from the authors. Inclusion criteria were (a) a systematic review of diagnostic test-accuracy studies, (b) inclusion of at least 10 original studies, (c) no exclusion of primary studies based on design features, and (d) the ability to reproduce the 2 x 2 tables from the original studies. The 28 reports yielded 30 systematic reviews. Details of the inclusion process are reported elsewhere (5).
A variety of conditions and index tests were studied in these 30 reviews (Table 1
). The median number of studies in a review was 14 (interquartile range, 1020). The median sample size of the individual studies was 100 (interquartile range, 43288).
|
assessment of methodological quality
We assessed the methodological quality of all 487 studies included in the 30 reviews with items from the QUADAS instrument (9) (Table 2
). We limited ourselves to the 7 QUADAS items most closely related to methodological quality and did not use the items that referred to quality of reporting. We dichotomized each item by scoring as deficient any study feature that was not reported.
|
QUADAS item 1 (Table 2
) refers to both the generalizability of results and the possibility that the study may produce biased results. We assessed 3 patient-spectrum components that refer to the distorted selection of participants, because previous studies have linked these components to biased accuracy estimates. These components were consecutive enrollment of patients, case-control or 2-gate design vs cohort design, and avoidance of limited challenge (2)(4). Limited challenge was defined as the exclusion of patients with disease characteristics that may produce false-positive or false-negative results (e.g., exclusion of patients with existing lung disorders in an accuracy study of spiral computed tomography for the diagnosis of pulmonary embolism). A 2-gate study was defined as a case-control study in which cases and controls are sampled from 2 distinct source populations by means of different selection criteria (15).
Two independent assessors conducted quality assessments, and consensus meetings resolved disagreements. If necessary, a third person made the final decision.
metaanalysis
We used the summary ROC model of Moses and Littenberg for our metaanalysis (16)(17)(18). Their model uses linear regression analysis to examine how D, the natural logarithm of the DOR, changes as a function of S, which is the sum of logit(sensitivity) and logit(1 specificity). S is related to the threshold for classifying a test as positive.
We modeled the intercept and slope of the model as fixed effects but included a random effect to allow for variation beyond chance among studies (19). We weighted studies by the inverse of the variance of the log DOR to allow for the precision with which each study measured the log DOR. This procedure gave more weight to larger studies.
In the multivariable quality-adjustment strategies, covariates representing quality items were added to the model; this step allowed the intercept and slope in the regression analysis to differ between subgroups of studies defined by the corresponding covariate. In all strategies, we estimated the summary DOR over all studies andthe metaanalysis at the mean S value of these studies. Because the DOR cannot be calculated in 2 x 2 tables containing a zero, we added 0.5 to all 4 cells in these situations as a continuity correction (16)(20).
strategies for incorporating quality
We compared the following 3 statistical approaches to account for quality in metaanalyses: (a) The "restrict" strategy applied to metaanalysis of high-quality studies only. Studies were regarded as "high-quality" when they fulfilled all quality criteria. (b) The "adjust all" strategy involved multivariable adjustment for all individual quality items by including all these items in a single multivariable model, irrespective of the strength of the association between these items and the DOR. (c) The "selective adjustment" strategy consisted of multivariable adjustment for only those quality items that were significantly associated with the DOR in a univariable analysis (P for entry <0.2) (21)(22).
These strategies were compared with a reference strategy in which all studies within the original metaanalysis were included, irrespective of their quality characteristics.
Differences in results between strategies may depend both on the definition of quality and on the statistical approach used. We therefore considered 2 different sets of quality items to define higher-quality studies. The first set was chosen because there is empirical evidence that they can lead to biased results (4)(5). This set, referred to as the "evidence-based" quality definition, includes QUADAS items 5, 6, 10, and 11 (Table 1
). The second set of quality items (QUADAS items 1, 5, and 6) is referred to as the "common practice" quality definition and was selected because these 3 items are often applied in diagnostic reviews (5)(11). The restrict strategy and the adjust-all strategy were applied twice, once with the evidence-based definition of quality and once with the common-practice definition.
comparisons and analysis
We compared the summary DOR and its 95% confidence interval for the reference strategy, which included all studies, with the 3 quality-adjusting strategies in all 30 systematic reviews. Differences in results between strategies were analyzed within each systematic review with the Wilcoxon signed rank test to determine whether a strategy consistently led to higher or lower estimates of diagnostic accuracy. To investigate whether the strategies that adjusted for quality also resulted in more precise summary DOR estimates, we again used the Wilcoxon signed rank test statistic to compare the different approaches with respect to the absolute widths of the natural logarithm of the 95% confidence interval around the mean DOR.
To determine whether the change in summary DOR would affect clinical decisions, we used 4 arbitrary categories, which were defined by the absolute size of the summary DOR. If a metaanalysis resulted in a point estimate of the DOR <16, the test was regarded as not useful. We regarded a test with a DOR of 1681 as moderately useful, a test with a DOR of 81361 as useful, and a test with a DOR >361 as very useful. The DOR values of 16, 81, and 361 correspond to sensitivity-specificity pairs of 80%80%, 90%90%, and 95%95%, respectively.
We used SAS for Windows, version 9.1.3 (SAS Institute) for all analyses and the proc mixed procedure in SAS to fit all models.
| Results |
|---|
|
|
|---|
|
Studies of the case-control or 2-gate type were included in 9 of the 30 reviews. Whether all patients had received the reference standard and whether the reference standard was the same for each patient were well reported (99% of the studies). In 3 reviews, the primary studies used different reference standards to verify index test results.
Applying the evidence-based definition of quality (items 5, 6, 10, and 11 of the QUADAS checklist) identified 72 (15%) of the 487 primary studies as high quality. With this definition, 12 of the 30 systematic reviews had no high-quality studies, and 9 reviews included at least 3 high-quality studies.
Applying the common-practice definition identified 70 high-quality studies (14%). With this definition, 9 systematic reviews contained no high-quality studies, and 11 reviews had at least 3 high-quality studies. Use of both definitions yielded only 3 reviews that contained
3 high-quality studies.
comparing the pooled estimates of the various strategies
The summary DORs and the corresponding 95% confidence intervals were obtained for all 30 systematic reviews with the reference and 3 quality-adjustment strategies (Fig. 2
).
|
The evidence-based restrict strategy, which analyzed only high-quality studies according to the evidence-based definition, could be applied in 9 reviews containing
3 high-quality studies. In 3 cases, the DOR for the high-quality studies was higher than the DOR obtained by ignoring quality and including all studies, whereas the opposite occurred in 5 cases (P = 0.64). In 1 review, the DOR did not change, because all studies were high-quality studies according to the evidence-based definition. We found only 2 or fewer high-quality studies in the other reviews, and we did not calculate a summary estimate based on these small numbers.
The restrict strategy with the common-practice definition could be used in 11 reviews. This restrict strategy produced a higher DOR in 4 metaanalyses and a lower estimate in 7 others. The mean odds ratio was not significantly higher or lower when quality was not incorporated, compared with the different restrictive strategies (Table 3
).
|
When we included all the items of the evidence-based quality definition as covariates in the multivariable model, model building failed in 9 reviews. In these reviews, at least 1 of the quality criteria was not fulfilled by any of the included studies. In 9 of the other 21 reviews, the adjust-all strategy resulted in a DOR estimate that was higher than when quality was not incorporated; 11 times the estimate was lower. In 1 review, all of the original studies could be regarded as of high quality, so there was no change in the summary DOR.
With the common-practice definition, we were able to make a multivariable adjust-all model in 23 reviews. The estimated DOR was higher in 10 reviews and lower in 13. The differences between analyzing studies irrespective of their quality and analyses with the 2 multivariable strategies were not significant (Table 3
).
The selective-adjustment strategy included only items that were significantly associated with accuracy in a univariable analysis (P <0.2). In 18 reviews, none of the QUADAS items was significantly associated with accuracy, and the use of all original studies in a metaanalysis yielded the same summary DOR as when quality was disregarded. In 5 reviews, only one single QUADAS item had a significant effect, and in a further 5, 1, and 1 metaanalyses respectively 2, 3, and 4 items were significant. The selective-adjustment strategy led to a higher estimate in 5 cases and to a lower estimate in 7 cases, compared with the metaanalysis in which quality was not incorporated.
Fig. 3
shows the relative DORs (compared with not including quality in the analysis) for the various quality-adjustment strategies. The symmetrical distribution around unity illustrates that there is no systematic trend in underestimating or overestimating the DOR of a test. However, in 5 cases, the alternative strategy resulted in a DOR >5 times higher than when quality was disregarded; in 3 cases the relative DOR was <0.2.
|
None of the quality-adjustment strategies produced systematically narrower confidence intervals for the summary DOR than analyzing studies irrespective of their quality (Table 3
). The confidence intervals were significantly wider with the restrict and adjust-all strategies (P <0.01) but did not significantly differ with the selective-adjustment method (P = 0.08).
Because differences between strategies can be due to both differences in quality definitions and differences in statistical methods, we compared the results between statistical methods within 1 definition. We also compared the results with 2 quality definitions within 1 strategy. We observed no systematic differences between the 2 approaches, either for the summary estimates or for their 95% confidence intervals.
The judgment about the usefulness of a test based on the magnitude of the summary DOR was not affected in 12 of the 30 reviews with any of the quality-adjustment strategies (Fig. 2
). In 18 reviews, the quality-adjusted DOR obtained with 1 or more of the quality-adjustment strategies ended in a different category than the DOR obtained with all studies included. The DOR was higher in 14 cases and lower in 17 others (Fig. 2
).
| Discussion |
|---|
|
|
|---|
A main problem that authors of systematic reviews encounter is poor reporting of study characteristics, and our study was no exception (23). We scored any study feature that was not reported as deficient. Dichotomizing QUADAS items into a simple "yes" or "no" can lead to loss of information, especially when many study characteristics are unreported. Some QUADAS items, such as the use of an adequate reference standard and the generalizability of the patient spectrum, could not be assessed at all in our data set. Both of these items can have a large effect on the performance of a test under study, and a proper incorporation of these characteristics could have resulted in a larger effect of the quality-adjustment strategies.
Because our analysis unit was the single metaanalysis, our sample size was only 30. Therefore, the power for detecting significant trends between strategies was limited, despite the inclusion of 487 individual studies. The 30 systematic reviews covered a wide range of clinical topics and diagnostic tests, with a wide variability in the magnitude of the DOR. Our primary outcome variable was the DOR, which is a single accuracy indicator that incorporates both the sensitivity and specificity of a test. Such a single indicator is convenient in the analysis, but it also means that any given summary DOR can be produced by innumerable sensitivity-specificity combinations. In practice, the value of 1 accuracy measure, say sensitivity, may be more critical than another if the implications of false-positive and false-negative test results differ in severity.
In our analysis, we refrained from calculating summary quality scores for studies and labeling any study that exceeded a certain threshold score as high quality. Such summary quality scores have been extensively studiedand criticizedin systematic reviews of intervention studies. Different shortcomings in study design may cause different forms of bias, making it almost impossible to determine the weight that should be given to each quality item in calculating such quality scores (24)(25). We also did not include a sequential analysis of the studies based on their quality ranking, which would have led to a quality-adjusted cumulative metaanalysis (26). This strategy also requires a hierarchical approach to study quality in that it assumes that some criteria are more important than others and that studies fulfilling more criteria are of higher quality.
Several previous studies have linked design features of diagnostic-accuracy studies to changes in accuracy estimates. One systematic review documented the theoretical and empirical evidence for several sources of bias (4)(5). Two publications, which examined these effects in a collection of systematic reviews, both reported significant effects for a number of features across metaanalyses (1)(2). We can only speculate why we failed to find any systematic differences from incorporating these study features in the metaanalysis process. These earlier studies analyzed the impact of deficiencies in quality in a large number of diagnostic-accuracy studies across a variety of systematic reviews, whereas our study assessed the impact of these quality items on estimates of diagnostic accuracy within systematic reviews. Furthermore, the number of studies with methodological deficiencies was small in a number of the systematic reviews included in our analysis, whereas other reviews contained only studies with deficiencies. Many of these studies with a deficient study design had a small sample size (27). Because the weight of an individual study depends on sample size, these studies had only a minor impact on the summary estimate of diagnostic accuracy. Furthermore, if 2 or more quality items influence accuracy but in opposing directions, the overall estimate obtained irrespective of quality may be similar to the estimate based on high-quality studies only. It is also possible that incomplete reporting has led to misclassification of design features in our project, which may have jeopardized our attempts to find differences in accuracy.
There are other potential explanations for our failed attempts at quality adjustment. The effects of several study-design features may not always be in the same predictable direction. Whether partial verification, for example, will lead to accuracy estimates that are unchanged, lower, or higher, depends on the pattern of verification and the reference standards being used. The ratio of patients with unverified positive index test results and patients of unverified negative test results matters, in particular when being verified or not is related to the presence or absence of the target condition.
Similar remarks have been made in the field of intervention studies, where more metaepidemiologic studies like ours have been performed (28)(29). The aim in metaepidemiologic studies is to evaluate the importance of 1 or more design features across a substantial number of systematic reviews. These studies have shown that metaepidemiologic studies require substantial numbers of systematic reviews with sufficient differences in methodological quality among the included studies. Furthermore, if the effects of design features vary in direction among reviews or even among studies within a single review, metaepidemiologic studies may produce summary estimates that suggest no effect at all (30)(32). Although we have found no systematic trend in results among strategies, reviews in which adjusting for quality has led to substantially different results clearly exist. Because we do not know the true magnitude of accuracy, it is impossible to tell whether the adjusted estimates were closer to the truth.
Not only did we fail to find support for our hypothesis that adjusting for quality will result in less optimistic estimates of test accuracy, we also found no evidence for the hypothesis that adjusting for quality leads to less heterogeneity in results and therefore to smaller confidence intervals. On the contrary, the alternative analyses generally produced broader confidence limits. The main reason for this result is that the alternative strategies were based on fewer studies.
Our study did not produce evidence for the superiority of one type of adjustment over another. Low-quality studies can produce accuracy statistics that do not differ from those obtained in high-quality studies. Although methodological quality may influence the results of metaanalyses, a direct association with results is not necessarily present.
In any review, poor quality will affect the trustworthiness of the conclusions of that review. Our results indicate that the strategy used to correct for quality may affect the estimated accuracy, but not in a predictable way. Our results also indicate that measuring and incorporating quality in a diagnostic review is not a simple task of routinely scoring a few standard quality items and then adjusting for these variables in a multivariable model.
There may be good reasons to identify some quality criteria as crucial for the credibility and applicability of any systematic review. An example could be the selection of the reference standardQUADAS item 3. These criteria may then be used as inclusion criteria for the review, and authors of systematic reviews might want to report how many studies had to be excluded based on that criterion.
Quality-assessment results of the studies included in a review remains a necessity because it notifies readers about the overall quality of the studies included in the review and may point out differences in design that can help to explain some of the heterogeneity in results. The QUADAS instrument can be used for that purpose. We propose to score "not reported" as a separate category where applicable, and we hope that a more widespread implementation of the STARD statement will lead to better reporting in future reports of diagnostic-accuracy studies (33)(34).
We feel it necessary that quality-assessment results in a systematic review be summarized in a table or a figure. A table can list the extent to which each of the studies fulfilled the quality criteria. A figure, such as the stacked bar chart in Fig. 1
, can then display the studies for which each of the respective criteria was fulfilled so that the reader can obtain an overview of the quality of the studies included in the review. Plotting results for all of the included studies in ROC space and coding individual studies by color or with symbols can help readers recognize the characteristics of individual studies.
In our view, whether quality is also to be incorporated in a metaanalysis depends on several factors. In the first place, analyzing quality is not even an option if the number of included studies is too low. If the results are very heterogeneous, quality differences can be used to search for an explanation for the heterogeneity, and such a search can be accommodated by stratification or, if appropriate, regression analysis. Caution is needed because it is not unusual for the potential explanations for observed differences to outnumber the studies in a systematic review. It is important to recognize the major limitations of metaepidemiologic approaches in metaanalysis.
Quality is a multidimensional concept, and the importance of individual quality items will vary from one research project to another. The goal of adjusting for quality differences in metaanalysis will remain attractive but elusive until we have large-scale systematic reviews and fully informative reporting in individual studies.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
The following articles in journals at HighWire Press have cited this article:
![]() |
J. S Cnossen, K. C Vollebregt, N. d. Vrieze, G. t. Riet, B. W J Mol, A. Franx, K. S Khan, and J. A M v. d. Post Accuracy of mean arterial pressure and blood pressure measurements in predicting pre-eclampsia: systematic review and meta-analysis BMJ, May 17, 2008; 336(7653): 1117 - 1120. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M.G. Leeflang, K. G.M. Moons, J. B. Reitsma, and A. H. Zwinderman Bias in Sensitivity and Specificity Caused by Data-Driven Selection of Optimal Cutoff Values: Mechanisms, Magnitude, and Solutions Clin. Chem., April 1, 2008; 54(4): 729 - 737. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Cnossen MD, R. K. Morris MD, G. ter Riet MD PhD, B. W.J. Mol MD PhD, J. A.M. van der Post MD PhD, A. Coomarasamy MD, A. H. Zwinderman MSc PhD, S. C. Robson MD, P. J.E. Bindels MD PhD, J. Kleijnen MD PhD, et al. Use of uterine artery Doppler ultrasonography to predict pre-eclampsia and intrauterine growth restriction: a systematic review and bivariable meta-analysis Can. Med. Assoc. J., March 11, 2008; 178(6): 701 - 711. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |