|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Evidence-based Laboratory Medicine and Test Utilization |
1 Department of Clinical Chemistry, University of Szeged, Medical Faculty, Szeged, Hungary; 2 Laboratoire de Biologie Polyvalente, Hôpital Général, Rodez, France; 3 Department of Pathology and Laboratory Medicine, The Ottawa Hospital, Ottawa, Ontario, Canada; 4 Department of Clinical Chemistry, Atrium Medical Centre, Heerlen, The Netherlands; 5 Institute of Clinical Laboratory Diagnosis, Zagreb University School of Medicine, Clinical Hospital Center, Zagreb, Croatia; 6 Laboratory of Clinical Biochemistry, Haukeland University Hospital, Bergen, Norway; 7 Department of Medical Informatics, University of Szeged, Medical Faculty, Szeged, Hungary.
aAddress correspondence to this author at: Department of Clinical Chemistry, University of Szeged, Medical Faculty, Somogyi Bela ter 1, Szeged, H-6725 Hungary. E-mail ahorvath{at}clab.szote.u-szeged.hu.
| Abstract |
|---|
|
|
|---|
Methods: After systematic searches of published and electronic resources dated between 1999 and 2007, 26 DM GLs, published in English, were selected and scored for methodological quality using the AGREE Instrument. Subgroup analyses were performed based on the source, scope, length, origin, and date and type of publication of GLs. Using a checklist, we collected laboratory-specific items within GLs thought to be important for interpretation of test results.
Results: The 26 diagnostic GLs had significant shortcomings in methodological quality according to the AGREE criteria. GLs from agencies that had clear procedures for GL development, were longer than 50 pages, or were published in electronic databases were of higher quality. Diagnostic GLs contained more preanalytical or analytical information than combined (i.e., diagnostic and therapeutic) recommendations, but the overall quality was not significantly different. The quality of GLs did not show much improvement over the time period investigated.
Conclusions: The methodological shortcomings of diagnostic GLs in DM raise questions regarding the validity of recommendations in these documents that may affect their implementation in practice. Our results suggest the need for standardization of GL terminology and for higher-quality, systematically developed recommendations based on explicit guideline development and reporting standards in laboratory medicine.
| Introduction |
|---|
|
|
|---|
| Materials and Methods |
|---|
|
|
|---|
selection of guidelines eligible for the study
Based on the titles and/or abstracts, references were screened for relevance (E. Nagy, J. Watine). Using this subset, 2 independent reviewers (E. Nagy, A.R. Horvath) applied the following inclusion criteria: the publication fulfilled the definition of GLs (1) and dealt with the use of laboratory tests for the diagnosis or monitoring of DM, and the GL was publicly available in a peer-reviewed journal and/or in nationally or internationally endorsed GL databases. If several updates of the GL were available during the studied time period, only the latest version was selected. All these criteria had to be met for enrollment into the study.
We excluded publications that contained therapeutic recommendations only, were primarily focused on technical/analytical/quality control/standardization/quality management issues, referred to special patient groups, or offered local protocols on best practice (i.e., restricted to 1 particular health care setting).
evaluation of the methodological quality of guidelines
It has been shown that compliance with the AGREE criteria of most GL development programs is high (8), and, therefore, we used AGREE, a standardized, generic, and validated checklist (4)(9)(10), along with its accompanying Training Manual, for the assessment of GLs. AGREE arranges 23 criteria, thereafter referred to as AGREE items I1 through I23, into 6 key domains (D1 through D6). Selected GLs were randomly allocated to 2 assessment teams with 4 reviewers per team. Reviewers were trained how to use AGREE in a pilot study before conducting this larger survey (5). To make the appraisal process as objective as possible, reviewers were provided with all supplementary files referenced by each GL and found in the public domain (E. Nagy), including background supporting materials, technical papers, or general GL development manuals issued by the respective GL agency.
Reviewers independently assessed the fulfillment of the AGREE criteria on a 4-point Likert scale. Disagreements in 2 or more scores between appraisers were resolved by discussion and consensus. An independent reviewers opinion (A.R. Horvath) was required in 1 case only for reaching consensus. Domain scores were expressed in percentages, and a final conclusion was reached about the acceptability of the GL according to the instructions of AGREE. A GL was "strongly recommended" if the majority of items scored 3 or 4 and most domain scores (i.e., at least 4 of 6) were >60%. A GL was "not recommended" if the majority of items rated 1 or 2 and most of the domain scores (i.e., 4 or more of 6) were <30%. Guidelines were "recommended with provisos or alterations" when the GL rated high (3 or 4) or low (1 or 2) on a similar number of items and most domain scores were between 30% and 60%. For investigating whether diagnostic GLs meet additional reporting standards (11) that are not covered in depth in the AGREE, we assessed the presence of the following items: (a) an evidence table, (b) a description of the grading system, (c) graded recommendations, and (d) an expiry or review date. Additionally, we assessed whether the GL contained data thought to be important for test interpretation (3), such as (e) prevalence, (f) diagnostic accuracy of tests, (g) preanalytical, and (h) analytical specifications. All reviewers checked the availability of these items, and results were summarized by 1 independent assessor (E. Nagy).
We created 5 subgroups of GLs based on their source, scope, length, and origin and whether they were supplemented with a guideline methods manual. We also investigated the quality of guidelines according to the date and type of publication. In the statistical analyses (K. Boda), the mean item and standardized domain scores of GL subgroups were compared by the Kruskal–Wallis test. Pair-wise comparisons were carried out using the Mann–Whitney U-test with Bonferroni correction. The frequency of reporting laboratory specific information in different guideline subgroups was compared with the Fisher exact test. The level of significance was set at P
0.01 because of multiple comparisons. All analyses were performed using SPSS for Windows, version 13.
| Results |
|---|
|
|
|---|
|
|
critical appraisal of guidelines
Based on the assessment of methodological quality, 22 GLs were recommended by reviewers, of which only 11 were strongly recommended and the rest "with provisos and alterations." Four GLs had 4 or 5 domains with scores <30%, and reviewers did not recommend their use (Table 1
).
The domain and item scores of individual GLs are shown in Table 1
and online Data Supplement 2, respectively. Table 2
summarizes the mean item scores and the number and proportion of GLs scoring >3 on the 4-point Likert scale. Overall, the best-performing domains were D1, "scope and purpose" (77%; Table 1
), with a high proportion of GLs scoring above 3 for all items (Table 2
). Although D4, "clarity and presentation," scored highly (76%; Table 1
), I18 within this domain performed poorly, as only 10 GLs (38%) were supported with tools for application (Table 2
).
|
Domains 2 and 3, which explored the process of GL development, showed lower scores (Table 1
). Nine GLs (35%) scored >60% in "stakeholder involvement" and 14 (54%) in "rigor of development" domains. Of individual items in D2, only a small proportion of GLs gave information about the composition and affiliations of the guideline development group, provided some information on patient involvement in the development process, defined their target users clearly, and pilot tested the GL by target users before publication (Table 2
). In D3, there are notable shortcomings in using systematic methods for searching the evidence or at least giving some information about literature retrieval, describing clearly the criteria for selecting the evidence, indicating the methods used for formulating recommendations, and giving information on the peer review and updating process. The lowest scores were achieved with "applicability" (34%) and "editorial independence" (39%) domains, in which each items performed very poorly (Table 2
and online Data Supplement 2).
qualitative analysis of guidelines
Date of publication.
Quality of GLs was also investigated according to the date of publication to see whether any improvement can be observed over time. Only the highest scoring D1 and D4 showed some marginal development in quality over the time scale investigated (Table 1
). GLs seem to have become more specific in stating their objectives (I1) and in creating more focused clinical questions (I2), and the recommendations in GLs have become more easily identifiable (I17) (online Data Supplement 2). However, the poor performance in D6 showed further deterioration from 2005 onward, with failures to report editorial independence and conflict of interest in the majority of GLs (Table 1
).
Type of publication.
We investigated how GL developers defined the type of their publications and whether these reflected the methods used for their development. There was diversity in definitions: 19 publications were labeled as GLs or recommendations, of which 7 stated that they were evidence-based, 4 position statements or reports, and 3 guidance documents (Table 3
). Among the 7 evidence-based GLs, 5 had evidence summaries and 6 graded recommendations. Three GLs that had evidence tables did not define their publications as evidence-based GLs (22)(23)(25). More than two-thirds of GLs (n = 18, 69%) defined their grading system, but only 16 (62%) graded their final recommendations (Table 3
).
|
Procedure for updating guidelines.
Item 14 investigates whether GL developers describe the procedures for updating recommendations, including the timescale, responsibilities, and methods used. Fifteen GLs (58%) gave a timescale or expiration date, of which 1 GL provided this information in a separate GL development manual of the issuing authority (Table 3
). The most frequent review date was 3 and 4 years. Only 10 GLs (38%) provided adequate information on the updating process (Table 2
).
Reporting of laboratory-specific information in diagnostic guidelines.
We investigated whether GLs covered essential laboratory-specific information, such as prevalence/pretest probability and diagnostic accuracy data or preanalytical and analytical factors critical for the correct interpretation and application of laboratory results in clinical practice (Table 3
). About 60% of the GLs mentioned these factors. Reporting these pieces of information was more frequent in diagnostic compared to combined GLs, but the difference was not statistically significant in the various GL subgroups, as discussed below (online Data Supplement 3).
subgroup analysis
GLs were grouped according to source, scope, length, origin, and availability of a guideline methods manual, to investigate whether there were statistically significant differences in GL quality in these subsets. Results are shown in Table 4
and online Data Supplements 3 and 4.
|
Subgrouping by source.
Grouping GLs by source of publication revealed that 1 GL was published in a peer-reviewed journal, 19 were available in electronic GL databases, and 6 in both sources. The GL that was published exclusively in a peer-reviewed journal (14) was not recommended for use by the assessors. None of the 6 GLs published in both peer-reviewed journals and GL databases were strongly recommended. GLs published in electronic guideline databases only received a more favorable overall assessment. A notable difference, at a level of significance of P
0.05, could be observed in the D5 only for the electronic GLs (Table 4
).
Subgrouping by scope.
The rate of occurrence of strongly recommended GLs was higher for the combined (50%) than for the diagnostic (30%) GLs, but the rate of GLs not recommended was also higher in the combined group. The difference was moderate (P
0.05) in D2 only, with combined GLs scoring higher (Table 4
). Moderate differences were also found in 4 individual items (online Data Supplement 4). Diagnostic GLs defined their objectives better (I1) and considered the cost implications of the recommendations more frequently (I20), whereas combined GLs defined their target users (I6) and their updating processes more precisely (I14) than diagnostic ones.
Subgrouping by length.
A clear relationship could be demonstrated between GL length and methodological quality (Table 4
). Most GLs that were not recommended were shorter, and all strongly recommended guidelines were longer than 50 pages. Significant differences between these subgroups could be found for most domains, with higher quality of the longer GLs. Moderate differences (P
0.05) could be observed with the "applicability" and "clarity and presentation" domains. However, the best-performing GLs, scoring >50% in the "applicability" domain (21)(24)(27)(28)(35)(36)(37), were generally longer than 50 pages, and all were published in electronic databases (Table 1
).
Subgrouping by origin.
The majority of the strongly recommended GLs (7 of 11) originated from the UK; the other 4 were from New Zealand, Australia, and the USA (Table 1
). Significant differences (P
0.01) could be observed in fulfilling the criteria of D2, with higher scores for the British GLs. In the "rigor of development" and "clarity and presentation" domains, the difference was moderate (P
0.05) (Table 4
).
Subgrouping by availability of guideline methods manual.
Two-thirds of GLs had some accompanying manuals describing the methods of their development in some form. All strongly recommended GLs had such a manual (Table 4
). All domain scores were better in the subset where these manuals were available, and the differences were highly statistically significant (P
0.01) in D4, D5, and D6. In D1, D2, and D3, the P values were also significant, but at values somewhat >0.01.
| Discussion |
|---|
|
|
|---|
In our study, the quality of purely diagnostic GLs was not significantly different from that of combined GLs (Table 4
). Our additional evaluation in Table 3
showed that nearly half of all GLs do not report preanalytical, analytical, and diagnostic accuracy data (3), which may lead to inappropriate interpretation of test results in clinical practice (44). Fulfilling these criteria would be desirable in any GLs that deal with laboratory testing–related issues, since it is expected that practice recommendations are developed in a multidisciplinary process (45). Unfortunately, this could not be confirmed by our study, as only 41% of the criteria were fulfilled in D2, which explored the involvement of all relevant stakeholders in the GL development process.
All GLs that scored better in the comparison by origin were from agencies with detailed GL manuals that provided a clear description and standards for the development process (Table 1
). The availability of a GL manual, however, does not always guarantee that GL teams follow those processes consistently, and it has been shown that it is often not clear how decisions are made by the GL team when arriving at final recommendations (8). The substantial heterogeneity, in both how the type of publication is defined and the adherence to this definition in the final presentation of the GL, suggests that there is likely a disparity between the methodology GL developers describe and what is actually followed in practice (Table 3
). We found several GLs that described a grading system but did not grade their final recommendations. The lack of evidence tables in GLs that claim to be evidence-based may also point to potential deviations from the processes set in GL manuals. Therefore it is advisable that diagnostic GL development teams adhere to preset methodology and document the procedures followed explicitly.
We could not demonstrate major improvements in GL quality for most domains, and in the "editorial independence" domain, deterioration in scores was observed over time. We further evaluated the quality of GLs over time in some cases where the authorities issued several GLs [e.g., National Institute for Clinical Excellence (NICE), WHO, International Diabetes Federation (IDF)] within the time scale investigated (data not shown). The NICE GL in 2004 is of higher quality than the NICE 2002 version due to improvements in "applicability" and "editorial independence" domains. It is noteworthy that many international organizations have improved the rigor of their guideline development process and are moving toward international standardization (11)(46)(47)(48). Surprisingly, the international WHO and IDF GLs in 2006 and 2007 had lower scores in most domains than the 2003 and 2005 versions, despite the fact that both agencies released guideline development manuals in 2003 (http://whqlibdoc.who.int/hq/2003/EIP_GPE_EQC_2003_1.pdf, http://www.idf.org). Therefore, we assume that the lower AGREE scores are due to the lack of reporting some methodological details rather than the lack of following the methodology described in the manuals. Explicit reporting of methodology and adherence to that methodology is particularly important for influential agencies (e.g., American Diabetes Association and WHO) whose recommendations are followed or adapted worldwide.
There are several limitations in our study. By evaluating English publications only, our results may suffer from language bias. However, several publications, including our own review of the topic, confirm no significant differences in the quality of English vs non-English publications of guidelines or trials (2)(49)(50). Because most national DM GLs are based on or strongly influenced by international recommendations primarily published in English, we believe our results are likely to be generalizable.
Our study evaluated different publications that were defined in various ways by their authors. Such heterogeneity of definitions (such as guideline, guidance, protocols, position statement, recommendation and rationale statement, consensus report) may highlight different approaches in formulating recommendations for practice. We also found several GLs that, while having proof of using evidence-based methods, failed to define their publication as such (22)(23)(25). This suggests that the definitions used in the international guideline community may be confusing for both GL developers and users, and that simplification and standardization of terminology is needed. One may argue that AGREE can be used for assessing evidence-based GLs only. However, AGREE is a generic and widely accepted toolbox (8) that can investigate the GL development process irrespective of whether it applies evidence- or consensus-based methodology (4). In fact, most evidence-based GLs have a substantial element of consensus-based judgment, especially when evidence is conflicting or lacking. In the latter case, GL developers should still search for and appraise the "best available" evidence before they conclude that the best they can do is to reach consensus.
Our study does not determine whether there are relationships between the methodological quality of GLs and the validity of their content. The AGREE Instrument or other GL appraisal tools can investigate neither the accuracy of the content of recommendations nor their impact on patient outcomes (51)(52). Another shortcoming of all critical appraisal tools is that they do not differentiate between whether the publication fails certain criteria due to lack of reporting or to poor methodology and design. Therefore, our results should not be interpreted as criticisms of the truth of scientific statements or the validity of recommendations made in a given publication. However, the demonstrated shortcomings in reporting and/or the methodology applied by different GL developers could lead to distrust in and/or misuse of recommendations (53). With such shortcomings, the energy put into developing scientifically accurate but otherwise poorly presented GLs could end up being wasted, whereas inaccurate but otherwise nicely presented GLs might be promoted and used widely. This is why we advise that GLs be critically evaluated for both methodology and content before recommendations are used in clinical practice (38).
In conclusion, our results suggest the need for systematically developed, explicit recommendations based on evidence-based guideline development and reporting standards in laboratory medicine. Our study also highlights the need for simplification and standardization of GL terminology. Further studies are needed to explore in depth the relationship between the scientific validity and the methodological quality of diagnostic recommendations in DM.
| Acknowledgments |
|---|
Authors Disclosures of Potential Conflicts of Interest: No authors declared any potential conflicts of interest.
Role of Sponsor: The funding organizations played no role in the design of study, choice of enrolled patients, review and interpretation of data, or preparation or approval of manuscript.
Acknowledgments: Several authors of this article (J. Watine, W. Oosterhuis, D. Rogic, S. Sandberg, P. S. Bunting, A.R. Horvath) are members of the Committee on Evidence-Based Laboratory Medicine of IFCC, and this work was carried out in collaboration with the IFCC Task Force on the Global Campaign for Diabetes Mellitus.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |