Selecting Quality and Resource Use Measures: A Decision Guide for Community Quality Collaboratives
Part II. Introduction to Measures of Quality
Data availability and validity are key elements to consider when selecting appropriate quality and resource use measures. It is also important to understand how measures are designed and constructed, how risk adjustment is performed, and how measure developers and endorsers are involved. Questions 7-13 introduce readers to a wide range of topics affecting measures of quality, as well as the national initiatives that are promoting the standardization of measurement.
Question 7. How are quality performance measures constructed, and what are the implications of how their numerators and denominators are specified?
Quality performance measures are constructed in a variety of ways, including proportions or percentages, ratios, means, medians, and counts. Each approach serves a purpose and is appropriate in specific circumstances. Whichever approach is used, the detailed specifications and inclusion and exclusion criteria are typically developed through a painstaking process of discussion with clinical experts and analyses of empirical data. Measures with the same title, but sponsored by different organizations, may have somewhat different properties, as was recently demonstrated for a measure of hospital outcomes known as "failure to rescue."52 Some indicators of hospital mortality exclude patients transferred in from other hospitals, whereas others include such patients.53
Minor but potentially confusing differences in the definitions of process-of-care measures should be reconciled, as The Joint Commission and the Centers for Medicare & Medicaid Services (CMS) have done for their Core Measures of hospital quality. Even the same measurement software, such as the AHRQ Patient Safety Indicators, can generate markedly different results depending on what may seem to be a minor choice, such as whether to turn on or off the option for using "present on admission" flags to identify events.54 Accordingly, community quality collaboratives should be cautious in comparing results over time and across settings as measure specifications change.
Proportions and Percentages
Most quality measures are constructed as proportions or percentages, where the denominator represents the number of persons treated by a health care provider during a defined time period who were at risk of, or eligible for, the numerator event. The numerator then represents the number of persons in the denominator who received the appropriate diagnostic test or treatment (e.g., aspirin for heart attack), or the number who experienced an adverse outcome (e.g., respiratory failure after surgery).
This method of constructing quality measures has several advantages, such as the fact that the range of performance is bounded between 0% and 100%, and the fact that multiple measures can easily be averaged to generate composite measures, as described in Question 10. This proportion/percentage method also facilitates comparison of performance across measures and sites. Its simplicity makes it understandable to consumers and actionable for health care providers; for example, most CAHPS® (Consumer Assessment of Healthcare Providers and Systems) survey questions on patients' experiences with care are transformed from their original form ("how often did your personal doctor..") to a dichotomous form ("always"/"usually" versus any other response), which can then be expressed as the percentage of patients with optimal or near-optimal experience.55
If multiple measures are presented side by side, then the polarity of some measures may need to be adjusted so that a higher percentage is always better (e.g., converting the percentage of patients who report a problem to the percentage who do not report that problem). The major drawback of the proportion/percentage approach is that it ignores interesting variation among those who are categorized as "yes" or "no," such as the relative severity of a complication (e.g., bloodstream infections with or without sepsis), the relative importance of a patient's negative experience, or the timeliness with which an appropriate therapy was provided.
A few quality measures are constructed as ratio measures, in which numerator cases may or may not be contained within the denominator. For these ratio measures, the denominator is viewed as the best available proxy for the true population at risk, because that population cannot be enumerated. For example, the AHRQ Prevention Quality Indicators (PQIs) are expressed as hospitalizations per 10,000 residents of the target area per year, but the number of residents of the target area is estimated from previous Census data, which do not fully account for recent in-migration and out-migration.56 The polarity of these measures must be clearly explained to consumers, because it is not immediately apparent whether lower values or higher values are better. In fact, cognitive testing has shown that consumers sometimes interpret higher asthma hospitalization rates as a sign of better care because they are concerned that aggressive health plans keep sick patients out of the hospital to save money.57
The major drawback of this ratio approach is that the denominator may be a poor proxy for the true population at risk. For example, only patients with diabetes are at risk for diabetes-related potentially preventable admissions, but the number of residents in the target area is a poor proxy for the number of diabetic patients. Given that the prevalence of diabetes varies across communities, PQI rates may vary for reasons unrelated to quality of care.
Means and Medians
A few quality measures are constructed as mean or median values. For example, one widely used measure of emergency department/hospital care for patients with heart attack is the median time from arrival to administration of fibrinolytic therapy in eligible patients with ST-segment elevation or left bundle branch block on the electrocardiogram performed closest to arrival in an emergency department. The directionality of these measures must be clearly explained to consumers, because it is not always apparent whether lower or higher values represent better care. The advantage of this approach is that mean or median values capture subtleties of care, such as the timeliness of treatment, better than proportion or percentage measures. It may be possible to distinguish differences in performance using mean or median values that could not be distinguished using proportion or percentage measures. However, the drawback of this approach is that it makes the data more difficult to analyze and present, and it is not applicable to most quality measures.
Finally, a few quality indicators are reported simply as counts (i.e., number) of adverse outcomes, without any specification of the population at risk. For example, the AHRQ Patient Safety Indicators for "Foreign Body Left In" and "Transfusion Reaction" are tabulated as counts at the hospital or area level, because they are extremely rare and every reported event merits investigation. These indicators are intended for surveillance purposes and not to compare performance across providers. Indicators of this type should not be used in public reporting or pay-for-performance programs, except for the limited purpose of promoting transparency.
Question 8. What specific measures can be used to calculate physician performance at the individual or organization level?
Some physician leaders and physician organizations have long been vocal advocates of measuring physician performance. In the United States, Dr. Ernest Codman at Massachusetts General Hospital was a pioneer of this movement in the first two decades of the 20th century. Organizations such as the Society of Thoracic Surgeons and the American College of Surgeons have long maintained clinical registries to which physicians contribute data on their patient outcomes, and from which they can withdraw reports comparing their outcomes with external benchmarks. In 1991, the Department of Veterans Affairs launched its pioneering National VA Surgical Risk Study (NVASRS) in 44 VA medical centers, which evolved into the comprehensive program now known as the National Surgical Quality Improvement Project (NSQIP).58 However, all of these efforts focused on confidential sharing of data through peer review mechanisms, which has been shown to improve patient outcomes but does not inform the market.59
Hurdles to Physician Performance Measurement
Measuring physician performance for public reporting has been slow to take off due to concerns about variation in patient risk at the physician level,60 poor measure reliability, and limited or incomplete data for risk adjustment at the physician level. For example, Scholle, et al., report that the denominator of eligible patients for a single physician from a single data source (i.e., health plan) is generally so small that results are unreliable.61 In previous research, this problem applied even to one of the most common diseases in primary care, diabetes.62 Awareness of these reliability issues is particularly important for pay-for-performance and public reporting programs, which are based on the hypothesis that performance varies meaningfully across physicians.62
Improving Completeness and Reliability
Several options exist to address the small denominator dilemma: using composite measures; using group-level reporting; combining multiple years of data; and combining data sources (e.g., Medicare data and commercial carrier data). Some organizations are experimenting with reporting composite measures to enhance reliability, including HealthPlus in Michigan, which produces a public report of CAHPS® (Consumer Assessment of Healthcare Providers and Systems) and clinical measure composites by physician name. Kaplan and colleagues recently studied a national sample of 210 physicians with 7,574 diabetic patients participating in the NCQA-American Diabetes Association's Diabetes Provider Recognition Program. They reported that process and intermediate outcome measures with a substantial "physician thumbprint" could be aggregated into a composite quality score with high reliability and excellent discrimination of physicians based on the quality of their diabetes-related care.63 Additional research efforts in this area are now underway and will likely bear fruit in the next few years.
Physician group-level reporting is currently the most common approach to this problem. Several collaboratives (e.g., California's Integrated Healthcare Association, Wisconsin Healthcare Value Exchange, Massachusetts Chartered Value Exchange, and Washington-Puget Sound Health Alliance) publicly report the results of approximately 12 Healthcare Effectiveness Data and Information Set (HEDIS)-based measures related to diabetes, heart disease, asthma, preventive care, pediatric care, and depression, along with selected other measures. These measures were chosen because of the frequency of the underlying condition, the availability of national benchmarks, and the potential of the data to be available through the Centers for Medicare & Medicaid (CMS) Generating Medicare Physician Quality Performance Measurement Results (GEM) project. This is also the approach that has been adopted for Clinician and Group (C/G) CAHPS® reporting. However, physician group-level reporting suffers from numerous implementation problems, including the difficulty of assigning physicians who belong to multiple groups or who change groups, mandatory exclusion of physicians in solo or very small group practices, fluid group structures that may differ for different payers in the same market, and poor identification of consumers with physician groups (especially groups that have multiple sites).
Although combining multiple years of data may be an attractive option to improve the reliability of physician-level reporting, there is a serious tradeoff involved. As one reaches farther back in time to obtain sufficient data, one also loses the ability to make inferences about current or future performance. Most users find it untenable to reach back more than 3 years for historic data on quality. Such an ascertainment period may be sufficient for hospital-level measures,64, 65 and has recently been adopted by CMS for reporting on 30-day outcomes after hospitalization for myocardial infarction, heart failure, or pneumonia, but even 3 years of data are often insufficient for physician-level measures.
The final alternative to solving the small denominator dilemma involves combining health care claims data from multiple payers. This is perhaps the most attractive method for boosting reliability, because it eliminates the possibility of confusing consumers with conflicting information about the same care provided by the same physician during the same time period. As regional coordinators, chartered value exchanges (CVEs) and other collaboratives have the opportunity to drive this data collection process forward. At the national level, the Better Quality Information (BQI) project (discussed in Question 13) addressed the challenges with aggregating Medicare physician claims data and private payer data. This effort is now being moved forward by the Quality Alliance Steering Committee's Measure Implementation Work Group, which has selected Colorado and Florida as pilot sites for implementing "a nationally consistent data aggregation methodology" through a hub established by America's Health Insurance Plans Foundation. At the local level, the Wisconsin Collaborative for Healthcare Quality (which also served as one of the BQI pilot sites) has been a pioneer in collecting quality-related data from physician organizations.
Community Collaborative Example
The Wisconsin Collaborative for Healthcare Quality (WCHQ) provides a unique example of physician organization measurement. Initiated 5 years ago by physicians who were motivated to improve the measurement system used by health plans, this collaborative includes most large medical groups and is starting to engage midsized groups as well. About 40% of physicians in Wisconsin submit clinical data from electronic records and chart review on 15 "home-grown" measures (derived from HEDIS measures) related to diabetes, hypertension, cardiovascular disease, postpartum care, and preventive screening. Their model embraces an all-patient, all payer philosophy. The denominator is derived through a three-question algorithm: "According to measure specifications, does the patient have the condition? Is this a patient who is managed by the group? Is this patient currently in the system?"
WCHQ also is piloting a registry-based submission system (RBS) for both WCHQ and CMS Physician Quality Reporting Initiative (PQRI) measures to make the data collection and validation process more efficient by allowing a few global patient files per reporting period to be submitted for aggregation. This also will expedite data validation. A revised business associate agreement permits secured patient-level data exchange for the transactions conducted by WCHQ and its associates.
Sources of Measures
Combining the criteria outlined in Question 22 with the supporting information here and suggested measure sources (Table 4) can provide a basis for selecting specific measures of physician performance. The purpose of Table 4 is to provide an overview of how the currently available measures are distributed across Institute of Medicine (IOM) domains and major source (also known as "developer" or "sponsor") organizations. However, the number of available measures changes weekly, and the same measure may qualify for two or more domains. For example, almost any measure of effectiveness can serve as a measure of equity if it is used to compare performance across populations. There is overlap between measures of patient centeredness and timeliness, in that patients expect and deserve timely care. Therefore, the numbers in this table are presented to provide an overall view of current opportunities and challenges in physician performance measurement. Community quality collaboratives should take note of each measure's specification (e.g., age range, time period), which can differ between seemingly similar measures and substantially affect results.
The answer to Question 22 lists several repositories that can be searched to identify physician performance measures in specific clinical domains, across multiple developer or sponsor organizations. The most widely used of these repositories are the AHRQ's National Quality Measures Clearinghouse and the National Quality Forum's list of endorsed standards. In addition, the AQA Alliance (described under Questions 19 and 21) offers a searchable compendium of approved performance measures that were submitted by at least five separate organizations, including some of those shown in Table 4.
Question 9. What specific measures can be used to calculate hospital performance regionally or nationally?
Hospital performance measurement for public reporting has a longer history than physician performance measurement, with more established methods. Several organizations have been involved in developing and refining hospital performance measures over the last decade; including AHRQ (HCAHPS® [Hospital Consumer Assessment of Healthcare Providers and Systems] and Quality Indicators), the Centers for Medicare & Medicaid Services (CMS) (Quality Measures Management Information System), The Joint Commission, and the Leapfrog Group.
Combining the criteria outlined in Question 22 with the supporting information here and suggested measure sources (Table 5) can provide a basis for selecting specific measures of hospital performance. The purpose of Table 5 is to provide an overview of how the currently available measures are distributed across Institute of Medicine (IOM) domains and major source (also known as "developer" or "sponsor") organizations. However, the number of available measures changes weekly, and the same measure may qualify for two or more domains. For example, almost any measure of effectiveness can serve as a measure of equity if it is used to compare performance across populations. There is overlap between measures of patient centeredness and timeliness, in that patients expect and deserve timely care. Therefore, the numbers in this table are presented to provide an overall view of current opportunities and challenges in hospital performance measurement. Community quality collaboratives should take note of each measure's specification (e.g., age range, time period), which can differ between seemingly similar measures and substantially affect results.
The answer to Question 22 lists several repositories that can be searched to identify hospital performance measures in specific clinical domains, across multiple developer or sponsor organizations. The most widely used of these repositories are AHRQ's National Quality Measures Clearinghouse and the National Quality Forum's list of endorsed standards. In addition, the Hospital Quality Alliance (described under Questions 19 and 21) notes adopted measures that were submitted by different organizations, including some of those shown in Table 5.
Question 10. What is the role and value of composite measures, and what are the most common approaches to constructing composites?
Composite measures, also known as summary or "roll-up" measures, combine individual measures into a single measure to summarize the overall quality of care delivered. AHRQ's Talking Quality Web site (www.talkingquality.gov/) defines a composite measure as "condensing a number of quality measures into a single piece of information."67 This section presents the advantages and disadvantages of composites, background on composite construction, and considerations for scoring or weighting measures. Because different composite constructs and methods are appropriate for different purposes, the specific choices that community quality collaboratives make are less important than simply describing and providing some reasonable rationale for those choices.
Advantages and Disadvantages of Composite Measures
Composite measures offer several important advantages over standalone measures, especially for public reporting and pay for performance:
- They reduce cognitive burden for consumers, making it easier for sponsors to rank provider performance and for consumers to identify high-quality providers. They also minimize the danger of "cognitive shortcuts" that sometimes lead data users to make incorrect decisions when they are trying to interpret conflicting information from different measures.68 For example, consumers may focus on one measure that they think is the most important, even if it is less informative than other measures.
- They enhance the reliability of quality measures, which is especially important at the individual physician level, because it is typically very difficult to discriminate high-performing from low-performing physicians. This feature is also important for relatively rare outcomes, such as mortality from low-risk procedures.
- They fit well conceptually with pay-for-performance programs, because the size of a provider's financial reward can be viewed as a composite measure of quality. Pay-for-performance rewards are typically based on multiple measures that are weighted and translated into dollar values. By creating and reporting composite measures, community collaboratives make this translation more explicit and set priorities to which providers can respond. However, when composites are used, providers often request to see their performance on each component of the composite so that they can decide how to concentrate their improvement efforts.22
Some potential disadvantages to using composite measures include:
- Difficulty achieving consensus on composite design and scoring. Collaboratives may choose to use nationally vetted composites, such as those used in CAHPS® (Consumer Assessment of Healthcare Providers and Systems), to shortcut the laborious design process. Aside from these unusual examples, there is no professional consensus about how to construct and score composites.
- Loss of important information if the composite combines unrelated metrics, thereby washing out meaningful differences on individual indicators (e.g., a hospital's performance on one or more specific indicators or procedures is significantly better or worse than its composite performance). Consumers may actually make the wrong decision if they use a multiprocedure composite measure to guide their choice of provider for a specific surgical procedure.
Producing Composite Measures
Two different conceptual approaches or perspectives underlie most composite measures; each approach has its advocates and detractors.69 The psychometric perspective is that an underlying, unmeasured factor, which we might call "quality," is the cause of what we observe with individual indicators. This approach is known as reflective because the observed data reflect this underlying, unmeasured factor, just as someone's IQ supposedly reflects his or her underlying intelligence. This approach requires a correlation among the measures included in the composite, because different measures can only reflect the same latent factor (i.e., quality) if they are correlated with each other. However, a problem with this approach is that different quality measures are often, in fact, uncorrelated or only weakly correlated with each other.70,71
The clinimetric perspective is not concerned about this lack of correlation; it uses clinical judgment rather than empirical analysis,72 and it is intended to guide decisionmaking rather than to measure a mysterious, latent factor.73 This approach is known as formative because the composite is formed from or defined by specific indicators, through averaging. For example, the Dow Jones Industrial Average is formed from market assessments of the value of 30 large corporations. This approach does not require any correlation among component measures. Although some authors still argue for testing composites to demonstrate their "internal consistency"74 (avoiding what some have described as combining "apples and airplanes"23), others emphasize that "both approaches have a useful role to play."69 On the next page, we discuss how a user's perspective affects his or her choice of a weighting method.
Steps for Constructing Composite Measures
If a community quality collaborative chooses to construct composite measures, the following steps may be useful. These steps are further described in a recent report from the National Quality Forum on composite measure evaluation:
- "Identify the purpose (e.g., comprehensive assessment of adult cardiac surgery quality of care) and delineate the quality construct to be measured (e.g., four domains of cardiac surgery quality include perioperative medical care, operative care, operative mortality, and postoperative morbidity).
- Select the individual measures and/or subcomposite measures to be combined in the composite measure. This step may entail "standardizing" measures to have similar distributional properties so that they can be combined more easily.
- Ensure that the weighting and scoring of the components supports the goal that is articulated for the measure. (Should the component scores be given equal weight or different weights based on some prioritization?)
- Combine the component scores, using a specified method, into one composite (e.g., sum, average, weighted average, patient-level all-or-none scoring, etc.).
- Finally, as with all measures, the composite requires testing to determine if it is a reliable and valid indicator of quality health care."
Using these criteria,75 the National Quality Forum (NQF) endorsed three of AHRQ's four Quality Indicator composite measures: Mortality for Selected Conditions, Pediatric Patient Safety for Selected Indicators, and Patient Safety for Selected Indicators.
Reporting and Describing Composite Measures
For community quality collaboratives developing their own composite measures, Kaplan and Normand provide specific recommendations about how to report and describe composite measures23:
- "A clear and concise description of the intended use of the composite should be provided (including)... specific details regarding how the composite will be used to quantitatively measure provider performance."
- "A rationale should be provided regarding the choice of the individual performance measures that comprise the composite (for example)... they are attributable to a provider; they vary across providers; they are mutable; and they are appropriate for measurement."
- "A clear description of how the items will be aggregated is necessary... A method for handling missing data also needs to be articulated and justified."
- "Justification and definition of any case-mix variables should be described."
The practical challenge in creating summary or "roll-up" measures is to decide how much weight to put on each component measure. Under the reflective approach, composite developers typically use empirical (psychometric) methods such as factor analysis and principal components analysis. Although such techniques are complex and require statistical expertise, they have the advantage of automatically generating weights that can be used to score composites. These weights usually reflect either the degree to which an individual measure explains an unmeasured or latent factor (e.g., quality) or its measurement reliability. Using the latter approach, more reliable measures with less random error are weighted more heavily because they are presumed to provide more valuable information.76 In some cases, as for the Clinician and Group (C/G) CAHPS® survey,77 sophisticated empirical methods have been used to validate relatively simple composites with equally weighted measures.78,79
Community quality collaboratives that apply the formative approach must adopt their own weighting scheme for scoring composite measures. Reeves, et al., reported that five commonly suggested methods for calculating composite measure scores, described in Table 6, resulted in very different scores for providers. They concluded that different methods are better suited to different types of applications80 For example, the "all-or-none" approach implicitly puts the most weight on the indicator with the poorest overall performance, so it is only appropriate when that indicator is actually the best "signal" of provider quality (or when overall performance is similar across indicators). The best candidates for all-or-none scoring are process measures "thought of as the indisputable basics of care for a given condition,"81 because "the desired outcome depends upon completion of a full set of tasks" (i.e., partial execution is simply unacceptable).80 For most applications, these assumptions do not hold and equal opportunity or equal indicator weighting is more appropriate.
The best approach, from the social perspective, might be to weight individual measures based on their impact on population health. For example, the AHRQ composite of "Patient Safety for Selected Indicators" is currently based on equal weighting of complications, although weights based on factor analysis (i.e., shared variance related to an underlying, unmeasured factor that might be called "quality") have also been published. A future approach to this composite might assign weights based on the expected "return on investment" from preventing complications; for example, the marginal average impact of each type of complication on quality-adjusted life years,82 hospital length of stay, or costs.83,84 An alternative weighting scheme for the CMS process measures for heart attack and heart failure has recently been proposed, in which each measure's weight is based on the product of its factor loading (to reflect its correlation with an underlying construct of quality) and its population standard deviation (to reflect its range for improvement). Compared with equal-opportunity weighting, this alternative scheme generates a composite that is more strongly associated with patient outcomes (i.e., inpatient survival).85
Community Collaborative Example
The California Office of the Patient Advocate (OPA) used much of the above information to redesign how it publicly reports health plan performance data. For example, in addition to reporting the seven individual HEDIS diabetes measures, OPA created its own diabetes composite that rolled all seven indicators into a "topic score" called "Diabetes Care." Similar composites were created for "Checking for Cancer," "Chlamydia Screening," "Treating Children," "Maternity Care," "Asthma Care," "Mental Health," "Heart Care," and "Treating Adults: Right Care." The design of these "topic scores" was informed both by empirical analyses of internal consistency and by structured input from consumers and other stakeholders. OPA selected "equal indicator weighting" and implemented an innovative approach to handle missing data, as explained in a technical document available on its Web site.