Background Report for the Request for Public Comment on Initial, Recommended Core Set of Children's Healthcare Quality Measures for Voluntary Use by Medicaid and CHIP Programs
Appendix A-6. Description of the Modified Delphi Process to Rate and Select Valid, Feasible Quality Measures, June 30, 2009
Initial Subcommittee Ratings
A draft of the children's healthcare quality measures identified as being in use by State Medicaid and CHIP programs, a summary of evidence supporting those measures, and measure rating sheets will be sent to all members of the expert panel by June 30 via E-mail. Subcommittee members are asked to return their scoring sheets by Friday, July 10. Each panelist is asked to rate each measure on a nine point scale for two dimensions: validity and feasibility
Validity is the degree to which a quality measure is associated with what it purports to measure (e.g., a CDS is a measure of structure or capacity; prescribing is a measure of a clinical process; asthma exacerbations are a measure of health outcomes).
A quality measure should be considered valid if:
- There is adequate scientific evidence or, where evidence is insufficient, expert professional consensus to support the stated relationship between:1
- Structure and process (e.g., that there is a demonstrated likelihood that a clinical decision support system (a structural or capacity measure) in a hospital or ambulatory office leads to increased rates of appropriate flu vaccination in the hospital or practice),
- Structure and outcome (.e.g., higher continuity of care in the outpatient setting (influenced by how appointments are organized) is associated with fewer ambulatory care sensitive hospitalizations, (e.g., hospitalizations for dehydration), or
- Process and outcome (e.g., that there is a demonstrated likelihood that prescribing inhaled corticosteroids (a clinical process) to specified patients with asthma will improve the patients' outcomes) and vice versa (e.g. that if we measure quality as a health outcome measure there is sufficient demonstrated likelihood that the outcome can be attributed to either health care delivery structures or clinical processes of care or an explicit combination of both)
- The healthcare system can be said to be responsible for performance and/or the related health outcome. The majority of factors that determine adherence to a measure is under the control of the clinician, clinic, hospital, the health plan, or the Medicaid or CHIP program subject to measurement.
Ratings of 1 to 3 mean that the measure is not a valid criterion for evaluating quality; ratings of 4 to 6 mean that the measure is an uncertain or equivocal criterion for evaluating quality; and ratings of 7 to 9 mean that the measure is clearly a valid criterion for evaluating quality.
A quality measure should be considered feasible if:
- The information necessary to determine adherence to the measure is likely to be found in available data sources, e.g. administrative billing data, medical records, or routinely collected survey data;
- A proxy measure for feasibility of implementation for a given quality measure would be the number of States routinely reporting performance on the measure (go to Table 2) with highly feasible measures being reported by more States
- Estimates of adherence to the measure based on available data sources are likely to be reliable and unbiased.
- Reliability is the degree to which the measure is free from random error.
Ratings of 1 to 3 mean that it is not feasible to find the information necessary to reliably score the measure using the available data sources; ratings of 4 to 6 mean that there will be considerable variability in the feasibility of finding the necessary information to reliably score the measure; and ratings of 7 to 9 mean that it is clearly feasible to find the information necessary to reliably score the measure.
1 Structure of care is a feature of a healthcare organization or clinician relevant to its capacity to provide health care. A process of care is a health care service provided to, on behalf of, or by a patient appropriately based on scientific evidence of efficacy or effectiveness. An outcome of care is a health state of a person resulting from health care.
The Nine-Point scale
The nine point scale has been used for more than two decades at RAND in developing explicit measures for evaluating appropriateness and qualityi. Essentially these methods require individuals who rate quality measures to place them into one of three categories (valid criterion for quality, equivocal criterion for quality, invalid criterion for quality) and each category can be rated on a three point scale to allow for some variation within category. The scale is ordinal so that a 9 is better than an 8 and so on. Because quantities (e.g., risk-benefit ratios) are not assigned to each number on the scale, the difference between and 8 and a 9 is not necessarily the same as the difference between a 5 and a 6. Explicit ratings are used because in small groups some members tend to dominate the discussion and this can lead to a decision that does not reflect the sense of the group.ii
As pre-work, Subcommittee members are asked to individually rate the measures on Table 1 for validity and feasibility and return this document to AHRQ in care of Denise Dougherty, Ph.D. (firstname.lastname@example.org) no later than Friday July 10, 2009.
Once Dr. Dougherty receives your ratings, AHRQ staff will summarize the results. You are encouraged to comment on all of the measures. In general, we find two main reasons for disagreement among groups of people who rate quality measures using the RAND modified Delphi method: unclear measure language or unclear scientific evidence. If a measure is poorly written, group members may disagree on validity because they are thinking of different patient scenarios. In such circumstances, we will work during the Subcommittee meeting in July to revise the measure language to more clearly state the intent. If a measure is based on unclear or mixed scientific evidence it may not be an acceptable quality measure to include in the Core Set.
Subcommittee Meeting July 22-23, 2009
At the July meeting you will receive a summary of the validity and feasibility ratings of all Subcommittee members. You will receive an individualized summary printout of the ratings with the distribution of all Subcommittee member ratings displayed above the rating line. Below the rating line a caret (ˆ) will mark your own rating for each measure. For example, for Measure 7 in Figure 2, two individuals rated validity as "9," six individuals rated validity as "7," and one person gave a rating of "6." If this were your rating sheet, the rating you assigned for validity would have been "9" which is indicated by the (ˆ) sign. Measures with multiple subparts (e.g., Measure 10 in Figure 2) receive a separate rating for each part.
We will discuss the validity and feasibility ratings for those measures having a large degree of disagreement among Subcommittee members. We will then review, discuss, and re-rate the measures as needed. Discussions will be organized by quality measure groups and will be designed to ensure that we have a common understanding of the content and application of each measure. The panel co-chairs, Drs. Mangione-Smith and Schiff, will lead this discussion of the measures.
Analysis of the Measure Ratings
The median is used to measure the central tendency for the Subcommittee members' ratings, and the mean absolute deviation from the median to measure the dispersion of the ratings. The final disposition of each measure is based on its median validity and feasibility scores. To be included in the final core set, a measure has to have a median rating of 7 to 9 on validity and 4 to 9 on feasibility (Table 4).
|Median Validity Rating||Median Feasibility Rating||Measure Disposition|
To determine agreement and disagreement among panelists, we use a statistical definition that can be applied regardless of the number of ratings available. This approach frames the definitions of "agreement" and "disagreement" in terms of hypotheses about the distribution of ratings in a hypothetical population of repeated ratings by similarly selected individuals.
For agreement we test the hypothesis that 80 percent of the hypothetical population of repeated ratings are within the same region (1-3, 4-6, 7-9) as the observed median rating. If we are unable to reject that hypothesis on a binomial test at the 0.33 level, we say that the measure is rated "with agreement."
For disagreement, we test the hypothesis that 90 percent of the hypothetical population of repeated ratings are within one of two extra-wide regions (1-6 or 4-9). If we have to reject the hypothesis on a binomial test at the 0.10 level, we conclude that the measure is rated "with disagreement." Finally, if the ratings cannot be classified as "with agreement" or "with disagreement," then they are classified as "indeterminate."
Importance and Impact of the Selected Quality Measures
Once we have gone through this initial work of rating and selecting among the measures based on their validity and feasibility, we will discuss various aspects related to the relative importance and impact of the selected measures.
i Brook RH. The RAND/UCLA appropriateness method. In: McCormack KA, Moore SR, Siegel RA, editors. Clinical Practice Guidelines Development: Methodology Perspectives. Rockville, MD: Agency for Health Care Policy and Research, 1994.
ii McGlynn EA, Kosecoff J, Brook RH. Format and conduct of consensus development conferences: a multi-nation comparison, in Goodman C and Baratz S (eds), Improving Consensus Development for Health Technology Assessment. Washington, D.C.: National Academy Press, 1990.