Facing Challenging Situations When Grading Strength of Evidence (Text

On September 20, 2011, Holger Schünemann and Nancy Berkman made this presentation at the 2011 Annual Conference.

On September 20, 2011, Holger Schünemann and Nancy Berkman made this presentation at the 2011 Annual Conference.

Slide 1

Facing Challenging Situations When Grading Strength of Evidence

Facing Challenging Situations When Grading Strength of Evidence

Presenters:
Holger Schünemann, MD, PhD, McMaster University
Nancy Berkman, PhD, RTI International

Slide 2

Process Overview

Process Overview

Image: A screen shot of a figure that presents a lifecycle of CER activities that span topic generation, evidence synthesis and implementation, dissemination and future research is shown.

Slide 3

Steps in AHRQ EPC Approach to Grading SOE

Steps in AHRQ EPC Approach to Grading SOE

  • Separately for RCT and observational study evidence, aggregated across studies, for each outcome.
  • Score 4 required domains:
    • Risk of bias.
    • Consistency.
    • Directness.
    • Precision.
  • Considering, possibly scoring, 4 additional domains:
    • Dose-response association.
    • Plausible confounding.
    • Strength of association.
    • Publication bias.
  • Combine into a single SOE grade.

Slide 4

Risk of Bias Domain Score

Risk of Bias Domain Score

  • Concerns both study design and study conduct for individual studies.
  • Assesses the aggregate quality or risk of bias of studies separately for RCTs and observational studies and integrates those assessments into an overall risk of bias score.
  • Scores: high, medium, or low:
    • High risk of bias lowers SOE grade.
    • Low risk of bias raises SOE grade.

Slide 5

Consistency Domain Score

Consistency Domain Score

  • Degree of similarity in the effect sizes of different studies within the evidence base.
  • Consistent: same direction of effect (same side of "no effect") and narrow range of effect sizes.
  • Inconsistent: non-overlapping confidence intervals, significant unexplained clinical or statistical heterogeneity, etc.
  • Unknown or not applicable: single study so cannot be assessed

Slide 6

Directness Domain Score

Directness Domain Score

  • Whether evidence reflects a single, direct link between the intervention of interest and the ultimate health outcome under consideration.
  • Direct: single direct link between the intervention and health outcome.
  • Indirect: evidence relies on:
    • Surrogate or proxy outcomes.
    • More than one body of evidence (no head-to-head studies).

Slide 7

Precision Domain Score

Precision Domain Score

  • Degree of certainty for estimate of effect with respect to a specific outcome.
  • Precise: estimate allows a clinically useful decision.
  • Imprecise: confidence interval is so wide that it could include clinically distinct (even conflicting) conclusions.

Slide 8

Additional Discretionary Domains

Additional "Discretionary" Domains

  • Dose-response association (pattern of larger effect with greater exposure): present, not present, NA.
  • Plausible confounders (confounding that works in the direction opposite, "weakens" effect): present, absent.
  • Strength of association (effect so large that cannot have occurred solely as a result of bias from confounders): strong, weak.
  • Publication bias: (not formally scored).
  • Unlike GRADE, applicability is considered separately.

Slide 9

Integrating Domain Scores Into a SOE grade

Integrating Domain Scores Into a SOE grade

  • EPCs can use different approaches to incorporating multiple domains into an overall strength of evidence grade:
    • GRADE algorithm.
    • EPC's own weighting system.
    • A qualitative approach.
  • Evaluation needs to be made by (at least) 2 reviewers.
  • Must document approach used.

Slide 10

AHRQ and GRADE Grading Categories

AHRQ and GRADE Grading Categories

AHRQGRADE
High
High confidence that the evidence reflects the true effect.
High
Very confident that the true effect lies close to that of the estimate of the effect
Moderate
Moderate confidence that the evidence reflects the true estimate of effect.
Moderate
Moderately confident in the effect estimate: the true effect is likely to be close to the estimate of the effect
Low
Low confidence that the evidence reflects the true effect
Low
Limited confidence in the effect estimate: the true effect may be substantially different
Insufficient
Evidence either is unavailable or does not permit a conclusion
Very Low
Very little confidence in the effect estimate: the true effect is likely to be substantially different

Slide 11

Challenge 1: CER of Benefits, 1 Study, No Meta-analysis or Cis

Challenge 1: CER of Benefits, 1 Study, No Meta-analysis or CIs

  • Topic: Antidepressant medication response in the elderly.
  • Evidence description: 1 fair quality RCT (N = 108). Outcome evaluated through 2 validated scales that are clinician administered.
  • Scale 1: Results reported in bar graph only: (p = 0.03).
  • Scale 2: Results reported in bar graph only: (p = 0.04).
Number of Studies;
Subjects
Domain ScoresSOE Grade
Risk of BiasConsistencyDirectnessPrecision
RCT: 1, 108MediumUnknownDirect????

Slide 12

Challenge 1: Precision Score

Challenge 1: Precision Score

  • AHRQ/GRADE approach: Precise.
  • AHRQ approach: Imprecise.
  • GRADE approach: Imprecision Serious (-1).
  • GRADE approach: Imprecision Very Serious (-2).

Slide 13

Challenge 1: Strength of Evidence Grade

Challenge 1: Strength of Evidence Grade

  • AHRQ/GRADE approach: High.
  • AHRQ/GRADE approach: Moderate.
  • AHRQ/GRADE approach: Low.
  • AHRQ approach: Insufficient.
  • GRADE approach: Very low.

Slide 14

Challenge (1)-Response

Challenge (1)—Response

  • Rules for precision:
    • Based on CI, number of events, effect size:
      • Not perfect but good guides.
      • Judgment simple and possible for this example.
      • Given only 108 people, downgrade for imprecision unless effect is huge (which we need for this evaluation) and possibly by two levels.

Slide 15

Creating a New GRADEpro File

Creating a New GRADEpro File

Image: GRADEprofiler software is shown.

Slide 16

GRADEprofiler software page: Defining a Health Care Question

Image: Screen shot of the GRADEprofiler software Defining a Health Care Question page is shown.

Slide 17

Optimal Information Size

Optimal Information Size

  • We suggest the following: if the total number of patients included in a systematic review is less than the number of patients generated by a conventional sample size calculation for a single adequately powered trial, consider rating down for imprecision. Authors have referred to this threshold as the "optimal information size" (OIS).

Slide 18

For Systematic Reviews

For Systematic Reviews

  • If the 95% CI excludes a relative risk (RR) of 1.0 and the total number of events or patients exceeds the OIS criterion, precision is adequate. If the 95% CI includes appreciable benefit or harm (we suggest a RR of under 0.75 or over 1.25 as a rough guide) rating down for imprecision may be appropriate even if OIS criteria are met.

Slide 19

Figure 4: Optimal Information Size Given Alpha of 0.05 and Beta of 0.2 for Varying Control Event Rates and Relative Risks

Figure 4: Optimal Information Size Given Alpha of 0.05 and Beta of 0.2 for Varying Control Event Rates and Relative Risks

Image: A line graph shows optimal information size given alpha of 0.05 and beta of 0.2 for varying control event rates and relative risks.

For any chosen line, evidence meets optimal information size criterion if sample size above the line.

Slide 20

Graph

Total Number of EventsRelative Risk ReductionImplications for meeting OIS threshold
100 or less≤ 30%Will almost never meet threshold whatever control event rate
20030%Will meet threshold for control control group risks of ~ 25% or greater
20025%Will meet threshold for control group risks of ~ 50% or greater
20020%Will meet threshold only for control group risks of ~ 80% or greater
300≥ 30%Will meet threshold
30025%Will meet threshold for control group risks of ~ 25% or greater
30020%Will meet threshold for control group risks of ~ 60% or greater
400 or more≥ 25%Will meet threshold for any control group risks
400 or more20%Will meet threshold for control group risks of ~ 40% or greater

Slide 21

Challenge 2: CER of Harms

Challenge 2: CER of Harms
Mixed Outcomes & Mixed Results from RCTs and Obs Studies

  • Topic: Risk of suicidality from antidepressants.
  • Evidence description:
    • RCT: 1 fair quality study:
      • Suicidal ideation worse Drug B (p = 0.03).
    • Case control: 1 fair quality study (N = 1300):
      • Non-fatal suicidal behavior; Drug A (OR = 1.16); Drug B (OR = 1.29).
      • Overlapping confidence intervals comparing each with Drug C.
    • Nested case control: 1 good quality study (N = 10,000):
      • Completed suicides in adjusted analyses (P = NS).

Slide 22

Challenge 2: CER of Harms

Challenge 2: CER of Harms
Mixed Outcomes & Mixed Results From RCTs and Obs Studies

Number of Studies;
Subjects
Domain ScoresSOE Grade
Risk of BiasConsistencyDirectnessPrecision
RCT:1; 90MediumNA??Imprecise??
Case control: 2; 11,400MediumConsistent??Imprecise

Slide 23

Challenge 2: Directness Score RCTs

Challenge 2: Directness Score RCTs

  • AHRQ/GRADE approach: Direct.
  • AHRQ approach: Indirect.
  • Grade approach: Serious indirectness (-1).
  • Grade approach: Very serious indirectness (-2).

Slide 24

Challenge 2: Directness Score Observational Studies

Challenge 2: Directness Score Observational Studies

  • AHRQ/GRADE approach: Direct.
  • AHRQ approach: Indirect.
  • Grade approach: Serious indirectness (-1).
  • Grade approach: Very serious indirectness (-2).

Slide 25

Challenge 2: Strength of Evidence Grade

Challenge 2: Strength of Evidence Grade

  • AHRQ/GRADE approach: High.
  • AHRQ/GRADE approach: Moderate.
  • AHRQ/GRADE approach: Low.
  • AHRQ approach: Insufficient.
  • GRADE approach: Very low.

Slide 26

Challenge (2)-Response

Challenge (2)—Response

  • Indirect comparison:
    • Downgrade:
      • Observational study can provide more direct evidence.
      • Need to go through full framework to find that out.

Slide 27

Challenge 3: CER of Benefits, RCTs Found No Difference Between Treatments

Challenge 3: CER of Benefits, RCTs Found No Difference Between Treatments

  • Topic: Medication response.
  • Evidence description: 5 fair quality RCTs, # of participants ranges from 90-200, each study: (p = NS).
  • Meta-analysis pooled risk ratio: 1.03 (95% CI, 0.92-1.16).
Number of Studies;
Subjects
Domain ScoresSOE Grade
Risk of BiasConsistencyDirectnessPrecision
RCT: 5, 690MediumConsistentDirectPrecise??

Slide 28

Challenge 3: Strength of Evidence Grade

Challenge 3: Strength of Evidence Grade

  • AHRQ/GRADE approach: High.
  • AHRQ/GRADE approach: Moderate.
  • AHRQ/GRADE approach: Low.
  • AHRQ approach: Insufficient.
  • GRADE approach: Very low.

Slide 29

Challenge 3: Are the treatments equivalent for this outcome?

Challenge 3: Are the treatments equivalent for this outcome?

  1. Yes.
  2. No.
  3. Don't know.

Slide 30

Challenge (3)-Response

Challenge (3)—Response

  • Superiority, inferiority and non-inferiority depend on more than one outcome.
  • Need to specify threshold. If threshold met, not imprecise, if not met, imprecise.

Slide 31

Figure 1, Rating Down for Imprecision in Guidelines: Thresholds are Key

Figure 1, Rating Down for Imprecision in Guidelines: Thresholds are Key

Image: A graph shows mortality estimate and confidence internal and thresholds if side effects.

Slide 32

Challenge 4: CER of Serious Harms, Mixed Findings in RCTs and Observational Studies

Challenge 4: CER of Serious Harms, Mixed Findings in RCTs and Observational Studies

  • Topic: Serious infection from rheumatoid arthritis treatments.
  • Evidence description:
    • RCTs: 4 fair quality studies. Number of participants ranges from 80 to 531. Number of serious infections presented for each treatment, very rare event. In each study (p = NS).
    • Retrospective cohort study 1: fair quality (N = 5,326). Hospitalization with a definite bacterial infection: Higher for Treatment A. Adjusted HR =1.94 (95% CI, 1.32 to 2.83).
    • Retrospective cohort study 2: good quality/low risk of bias (N = 2,369) Adjusted rate of serious bacterial infection: RR =1.0 (95% CI, 0.6 to 1.71).

Slide 33

Challenge 4: CER of Serious Harms, Mixed Findings in RCTs and Observational Studies

Challenge 4: CER of Serious Harms, Mixed Findings in RCTs and Observational Studies

Number of Studies;
Subjects
Domain ScoresSOE Grade
Risk of BiasConsistencyDirectnessPrecision
RCT: 4; 1,215MediumConsistentDirectPrecise??
Retrospective Cohort : 2; 7,695??InconsistentDirectImprecise

Slide 34

Challenge 4: Risk of Bias Score

Challenge 4: Risk of Bias Score

  • AHRQ/GRADE approach: Low risk of bias.
  • AHRQ approach: Medium risk of bias.
  • AHRQ approach: High risk of bias.
  • GRADE approach: Serious risk of bias (-1).
  • GRADE approach: Very serious risk of bias (-2).

Slide 35

Challenge 4: Strength of Evidence Grade

Challenge 4: Strength of Evidence Grade

  • AHRQ/GRADE approach: High.
  • AHRQ/GRADE approach: Moderate.
  • AHRQ/GRADE approach: Low.
  • AHRQ approach: Insufficient.
  • GRADE approach: Very low.

Slide 36

Challenge (4)-Response

Challenge (4)—Response

  • Sequential work:
    • Use the evidence that is of higher quality:
      • Mention observational evidence in footnote.

Slide 37

Challenge 5: Can You Use Less Stringent Criteria to Evaluate Risk of Bias If the Outcome Without Treatment is Likely to Result in Death?

Challenge 5: Can You Use Less Stringent Criteria to Evaluate Risk of Bias If the Outcome Without Treatment is Likely to Result in Death?

  • Topic: use of Hematopoietic stem cell transplantation (HSCT), also known as bone marrow transplantation.
  • Low Risk of Bias modified to be: natural history (or severity) of disease made spontaneous remission highly unlikely or impossible.
  • Evidence description:
  • For single HSCT for Wolman's disease: The natural history of this disease death occurs by approximately 6 months of age. Of five cases reported in the evidence, three patients were alive at 4 to 11 years' followup, with normal function and attending school. The strength of the body of evidence is high.

Slide 38

Challenge 5: Do You Agree That It Would Be Appropriate to Use Less Stringent Criteria to Evaluate Risk of Bias Under These Circumstances?

Challenge 5: Do You Agree That It Would Be Appropriate to Use Less Stringent Criteria to Evaluate Risk of Bias Under These Circumstances?

  1. Yes.
  2. No.
  3. Don't know.

Slide 39

Challenge 5: Do You Agree That It Would Be Appropriate to Use Less Stringent Criteria to Evaluate Risk of Bias Under These Circumstances?

Challenge 5: Do You Agree That It Would Be Appropriate to Use Less Stringent Criteria to Evaluate Risk of Bias Under These Circumstances?

  • One reviewer commented that, rather than modifying Risk of Bias criteria, "the SOE system does allow consideration of other factors through the 'optional domains' if applied correctly." These optional domains are:
  • Dose-response association.
  • Plausible confounding that would decrease observed effect.
  • Strength of association (magnitude of effect).
  • Publication bias.
  • Do you agree?

Slide 40

Challenge (5)-Response

Challenge (5)—Response

  • Particular design features of extremely rigorous well-conducted observational studies may warrant consideration for rating up quality of evidence. For instance, a case-control study found that sigmoidoscopy was associated with a reduction in colon cancer mortality for lesions in range of the sigmoidoscope (OR 0.30, 95% CI 0.19 to 0.48), but not beyond the range of the sigmoidoscope (OR 0.96, 95% CI 0.61 to 1.50). Possible bias because of unmeasured confounders should have been very similar if not identical in the two situations, considerably raising confidence in the causal effect of the sigmoidoscopy.

Slide 41

Challenge (5)-Response

Challenge (5)—Response

  • Furthermore, when considering rating up the quality of evidence for magnitude of effect, factors relating to the magnitude are rapidity of treatment response, and the previous underlying trajectory of the condition. For example, we feel confident that hip replacement has a large effect not only because of the size of the treatment response, but because the natural history of hip osteoarthritis is a progressive deterioration that surgery rapidly and uniformly reverses. The rapidity of response compared to the known trajectory of the condition can also be considered (and calculated) as a large effect size.
  • An additional factor mitigating the problem of rating up the quality because of a large effect is that indirect evidence usually provides further support for large treatment effects. For example, oral anticoagulation in mechanical heart valves has not been compared to placebo in an RCT, but evidence from observational studies suggests a large effect of oral anticoagulation in decreasing thromboembolic events. Supplementary indirect evidence from randomized trials that have demonstrated large reductions in the relative risk of thrombosis with anticoagulation in analogous conditions such as atrial fibrillation further increases our confidence in the beneficial effect of anticoagulation.
  • Similarly, the effectiveness of antibiotic prophylaxis in a variety of other situations supports observational studies that suggest that antibiotic prophylaxis results in an 89% relative risk reduction in meningococcal disease in contacts of patients who have suffered the illness.
  • Another situation allows an inference of a strong association without a formal comparative study. Consider the question of the impact of routine colonoscopy versus no screening for colon cancer on the rate of perforation associated with colonoscopy. Here, a large series of representative patients undergoing colonoscopy will provide high quality evidence on the risk of perforation associated with colonoscopy. When control rates are near 0 (i.e., we are certain that the incidence of spontaneous colon perforation in patients not undergoing colonoscopy is very low), case series of representative patients (one might call these cohort studies of affected patients if they include large numbers of patients) can provide high quality evidence of adverse effects associated with an intervention, thereby allowing us to infer a strong association from even a limited number of events. One should not confuse the situation highlighted in the previous example with isolated case reports of associations between exposures and rare adverse outcomes (as have, for instance, been reported with vaccine exposure).

Slide 42

Challenge 6: Challenges in Using GradePro.

Challenge 6: Challenges in Using GradePro.

  • "I find it challenging to use GRADEpro to grade the body of evidence for non-RCTs and unpooled data.".
  • Comments?

Slide 43

Challenge (6)

Challenge (6)

  • Response:
    • GRADEpro is updated for observational studies.
    • Unpooled data: headcount as last resort, can still make qualitative judgments as long as transparent (e.g. inconsistency, imprecision).

Slide 44

Challenge 7

Challenge 7

Current grading schemes are not amenable to  healthcare quality improvement studies because:

  • They may only distinguish between RCTs and "all other" types of studies.
  • They may not distinguish quality of studies within RCTs and other types of study designs.
  • They do not have a way to appropriately grade external validity, which is critically important in QI studies.
  • Comments?

Slide 45

Challenge 7-Response

Challenge 7—Response

  • They may only distinguish between RCTs and "all other" types of studies:
    • GRADE makes explicit judgments necessary about the confidence in estimates of effects for any study design. Randomization is just one of the criteria early on in the process as it is the key method to protect against bias.
    • They may not distinguish quality of studies within RCTs and other types of study designs.
    • GRADE's explicit judgments do make this distinction.
    • They do not have a way to appropriately grade external validity, which is critically important in QI studies.
    • Judgments about directness do accomplish that (PICO) where P includes the setting.

Slide 46

More Information

More Information

Holger Schunemann
Chair and Professor
Department of Clinical Epi and Biostatistics
McMaster University
schuneh@mcmaster.ca

Nancy Berkman
Senior Health Policy Research Analyst
Program on Healthcare Quality and Outcomes
berkmanschuenemann@rti.org

Page last reviewed March 2012
Internet Citation: Facing Challenging Situations When Grading Strength of Evidence (Text . March 2012. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/news/events/conference/2011/berkman-schunemann/index.html