|
|
||||||||
Opinion |
1 Julius Center for Health Sciences and Primary Care, University Medical Center, Utrecht, The Netherlands
aAddress correspondence to this author at: Julius Center for Health Sciences and Primary Care, University Medical Center, P.O. Box 85500, 3508 GA Utrecht, The Netherlands. Fax 31-30-2505485; e-mail K.G.M.Moons{at}jc.azu.nl.
| Introduction |
|---|
|
|
|---|
Various reviews have demonstrated that the majority of published studies of diagnostic accuracy still have methodologic flaws in design or analysis or provide results with limited practical applicability (1)(2)(3). This has been attributed to the absence of a proper methodologic framework for diagnostic test evaluations as, for example, exists for studies of therapies and etiologic factors and has motivated various researchers to establish frameworks for studies of diagnostic accuracy, such as the recent STARD Initiative (4)(5)(6)(7)(8)(9)(10)(11)(12). In our view, an issue that has received too little attention in most of these methodologic essays is the difference between test research and diagnostic research.
By "test research" we refer to studies that follow a single-test or univariable approach, i.e., studies focusing on a particular test to quantify its sensitivity, specificity, likelihood ratio (LR), or area under the ROC curve (ROC area). We call this test research because it merely quantifies the "characteristics" of the test rather than the tests contribution to estimate the diagnostic probability of disease presence or absence. By "diagnostic research" we refer to studies that aim to quantify a tests added contribution beyond test results readily available to the physician in determining the presence or absence of a particular disease. Although the multivariable and probabilistic character of medical diagnosis is slowly gaining appreciation in medical research, the majority of studies on diagnostic accuracy may still be regarded as test research (2)(3)(8).
We believe that test research has limited applicability to clinical practice. Below we describe why we believe this is the case, provide a brief description of a better approach, and give two clinical examples illustrating the hazards of test research. Finally, we describe the few instances in which test research may be worthwhile.
| Why Does Test Research Have Limited Relevance to Practice? |
|---|
|
|
|---|
test characteristics are not fixed
The second reason that results from test research have limited relevance is that a tests sensitivity, specificity, LR, and ROC area tend to be taken as properties or characteristics of a test. This, however, is a misconception, as we discussed recently (13). It is widely accepted that the predictive values of a test vary across patient populations. However, several studies have empirically shown that the sensitivity, specificity, and LR of a test may vary markedly, not only across patient populations (14) but also within a particular study population (13)(15)(16)(17). Within different patient subgroups, defined by patient characteristics or other test results, a particular test may have different sensitivities and specificities. This is because all diagnostic results obtained from patient history, physical examination, and additional tests are to some extent related to the same underlying disorder. For example, immobility, gender, and use of oral contraceptives are associated with the development of, and thus the presence of, DVT. In turn, the presence of DVT determines the presence of symptoms and signs and also (the probability of finding) a positive d-dimer assay result. Accordingly, via the underlying disorder, all diagnostic results are somehow correlated and thus mutually determine each others sensitivity, specificity, and LR to various extents (13)(15)(16)(17). A single value of a tests sensitivity, specificity, LR, ROC area, or predictive value that applies to all patients of a study sample does not exist. Hence, there are no fixed test characteristics.
selection bias
The most widely acknowledged limitation of test research is that studies often apply an improper patient recruitment and study design (1)(2)(3)(7). Investigators often select study participants among those who underwent the reference test in routine practice, i.e., selection based on a "true" presence or absence of the disease. The results of the test(s) under study are retrieved from the medical records and then compared across those with and without the disease. Such a casecontrol design commonly leads to selection bias, known as verification, workup, or referral bias (9)(18)(19).
Although such patient recruitment methods and study designs have decreased in the past decade, test research is still frequently based on individuals selected based on their final diagnosis (1)(2)(3). The need for proper patient recruitment is extensively addressed in the STARD checklist (11)(12). Study participants should be selected in agreement with the indication for diagnostic testing in practice, i.e., on their suspicion of having a particular disease, rather than on the presence or absence of that disease. Such unbiased selection of study participants may indeed be problematic for diagnostic laboratories or imaging centers that do not have access to consecutive series of patients suspected of having the disease. Moreover, most hospital databases code patients according to their final diagnosis rather than by their presenting symptoms or signs. The use of a system to register patients not only on their final diagnosis but also on their clinical presentation would enhance the validity and clinical relevance of diagnostic accuracy research (20).
| Proposed Approach for Diagnostic Accuracy Research |
|---|
|
|
|---|
Because the d-dimer assay will always be applied after history taking and physical examination, the statistical analysis requires a comparison of the (average) probability of disease presence without and with the d-dimer assay, overall or in subgroups. Such sequential modeling of the diagnostic probability as a function of different combinations of test results can be done using, e.g., multivariable logistic regression. Such multivariable analyses account for the mutual dependencies between different test results and thus indicate which tests truly do and which do not independently contribute to the estimation of the probability of disease presence. In addition, various orders of diagnostic testing can be analyzed. The result of such analysis is the definition of one or more diagnostic prediction models including only the relevant tests. If needed, such prediction models can be simplified to obtain readily applicable diagnostic decision rules for use in practice. Various authors have applied or described the details of such an analytical approach (20)(23)(24)(25)(26)(27).
Multivariable diagnostic prediction models or rules are not the solution to everything. They may have several drawbacks, such as overoptimism, although methods have been described to overcome some of these drawbacks (23). The need for multivariable modeling in diagnostic research, however, is not different from other types of medical research, such as etiologic, prognostic, and therapeutic research. It is not the singular association between a particular exposure or predictor and the outcome that is informative, but their association independent of other factors. For example, in etiologic research, investigators never publish the crude estimate between exposure and outcome only, but always the association in view of other risk factors (confounders), using a multivariable analysis as well (13). Similarly, in diagnostic accuracy research, multivariable modeling is necessary to estimate the value of a particular test in view of other test results. As in other types of research, such knowledge cannot be inferred from singular, univariable test parameters (7)(8)(13).
Fortunately, a multivariable approach in design and analysis aiming to quantify the independent value of diagnostic tests has gained approval (20)(23)(24)(25)(26)(27). In addition, the above study question on the added value of the d-dimer assay in diagnosing DVT has been evaluated in such a way. The d-dimer assay appeared to have an added predictive value to patient history and physical examination, particularly in patients who have a low clinical probability of DVT (27).
| Clinical Examples |
|---|
|
|
|---|
In an Australian study, 399 consecutive dyspeptic patients referred for endoscopy underwent two tests, the rapid urease test and the 13C breath test, for Helicobacter pylori (HP) with endoscopy as the reference test (28). The investigators found large differences in the test results between patients with a normal and abnormal endoscopy. The sensitivity and specificity were 96% and 67% for the rapid urease test and 91% and 82% for the 13C breath test. The authors concluded that the HP tests might have potential for the initial evaluation of dyspepsia and needed further evaluation in general practice. A second study was done by Weijnen et al. (26). Using a sequential multivariable approach, they found in a consecutive series of 565 dyspeptic patients referred for endoscopy that the HP test did not add diagnostic information to the predictors from history (i.e., history of ulcer, pain on empty stomach, and smoking). The ROC area of the model with only predictors from patient history was 0.71, which was increased to only 0.75 (P = 0.46) after addition of the HP test result. They concluded that HP testing in all dyspeptic patients has no value in addition to history taking.
Cowie et al. (29) studied a consecutive series of 122 patients suspected of heart failure. They measured in each patient the plasma concentrations of three natriuretic peptides, A-type natriuretic peptide (ANP), N-terminal ANP, and B-type natriuretic peptide (BNP), as well as the presence or absence of heart failure, using consensus diagnosis based on chest radiography and echocardiography as the reference test. They found that the mean concentration of each natriuretic peptide separately (single-test approach) was significantly greater in the patients with heart failure (all P <0.001). They also evaluated all three together in a multivariable logistic prediction model. Only the BNP measurement remained significantly associated with heart failure presence, whereas the other two did not add any predictive information.
Both examples show that one may qualify a test differently (commonly more promisingly) when only the results of a univariable or single-test approach are considered. Evaluating a particular test in view of other test results and accounting for mutual dependencies may decrease or even diminish its diagnostic contribution, simply because the information provided by that test is already provided by the other tests. Because in real life any test result is always considered in view of other patient characteristics and test results, diagnostic accuracy studies that address only a particular test and its characteristics have, in our view, limited relevance to practice. Indeed, as shown by Reid et al. (30), test characteristics are hardly ever actually used by practitioners.
| Is There a Place for Test Research? |
|---|
|
|
|---|
The second situation, as suggested previously, is in the initial phase of developing a new test or evaluating an existing test in a new context; single-test evaluations in these circumstances may be useful for efficiency reasons (4)(6)(7)(25). Such initial test research should apply a casecontrol approach, preferably starting with a sample of patients with the disease (cases) and a sample of healthy controls. If the test cannot differentiate between these two extreme or heterogeneous outcome categories, the test development process would likely be terminated. In such instances, it will be unlikely that the test does show discriminative value in patients suspected of having the disease, i.e., the population for which the test is intended, because these patients present with similar disease profiles, leading to an even more homogeneous case mixture. However, once the test does yield "satisfactory" diagnostic indices in such an initial test research study, we believe that its independent predictive contribution to existing diagnostic information in a clinical context can and must still be quantified by the above proposed approach.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
The following articles in journals at HighWire Press have cited this article:
![]() |
X. Bossuyt, K. Verweire, and N. Blanckaert Laboratory Medicine: Challenges and Opportunities Clin. Chem., October 1, 2007; 53(10): 1730 - 1733. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Suarthana, K. G M Moons, D. Heederik, and E. Meijer A simple diagnostic model for ruling out pneumoconiosis among construction workers Occup. Environ. Med., September 1, 2007; 64(9): 595 - 601. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. van Zaane, A. P. Nierich, W. F. Buhre, G. J. Brandon Bravo Bruinsma, and K. G. M. Moons Resolving the blind spot of transoesophageal echocardiography: a new diagnostic device for visualizing the ascending aorta in cardiac surgery Br. J. Anaesth., April 1, 2007; 98(4): 434 - 441. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. G.M. Moons Questionnaire to Distinguish between Stress and Urge Urinary Incontinence Ann Intern Med, December 19, 2006; 145(12): 934 - 935. [Full Text] [PDF] |
||||
![]() |
C. S Moskowitz and M. S Pepe Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs Clinical Trials, June 1, 2006; 3(3): 272 - 279. [Abstract] [PDF] |
||||
![]() |
F. H Rutten, K. G M Moons, and A. W Hoes Improving the quality and clinical relevance of diagnostic studies. BMJ, May 13, 2006; 332(7550): 1129 - 1129. [Full Text] [PDF] |
||||
![]() |
P. M Bossuyt, L. Irwig, J. Craig, and P. Glasziou Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ, May 6, 2006; 332(7549): 1089 - 1092. [Full Text] [PDF] |
||||
![]() |
P. Whiting, R. Harbord, C. Main, J. J Deeks, G. Filippini, M. Egger, and J. A C Sterne Accuracy of magnetic resonance imaging for the diagnosis of multiple sclerosis: systematic review BMJ, April 15, 2006; 332(7546): 875 - 884. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. H Rutten, K. G M Moons, M.-J. M Cramer, D. E Grobbee, N. P A Zuithoff, J.-W. J Lammers, and A. W Hoes Recognising heart failure in elderly patients with stable chronic obstructive pulmonary disease in primary care: cross sectional diagnostic study BMJ, December 10, 2005; 331(7529): 1379. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Puhan, J. Steurer, L. M. Bachmann, and G. t. Riet A Randomized Trial of Ways To Describe Test Accuracy: The Effect on Physicians' Post-Test Probability Estimates Ann Intern Med, August 2, 2005; 143(3): 184 - 189. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Oudega, A. W. Hoes, and K. G.M. Moons The Wells Rule Does Not Adequately Rule Out Deep Venous Thrombosis in Primary Care Patients Ann Intern Med, July 19, 2005; 143(2): 100 - 107. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. de Jonge, R. Brouwer, M. Smit, M. de Frankrijker-Merkestijn, R. J. E. M. Dolhain, J. M. W. Hazes, A. W. van Toorenenbergen, and J. Lindemans Automated counting of white blood cells in synovial fluid: reply Rheumatology, September 1, 2004; 43(9): 1201 - 1202. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |