Categorizing High and Low Performance in Quality Reports

Some sponsors worry about creating a report which clearly signals levels of performance across plans or providers. They fear that by doing this they are “interpreting” the data rather than simply presenting it. This kind of sponsor typically just presents the absolute data on all entities included in the report, without giving any signals about differences among the scores.

This approach leaves consumers on their own, as they don’t have any guidance in figuring out how much of a difference is important. They may think any difference, even a tiny one, is important (especially if the only differences in performance are tiny). They may think they only need to be aware of really big differences.

The goal of reports is to help consumers, providers, and policymakers make fair and meaningful comparisons. It is more difficult that one might think to do that without extra help. Here are two alternative strategies to consider:

Using statistical tests to identify “significantly different” performance.
Going beyond statistics to identify differences in performance that are clinically or in some other way substantively different.

Using Statistical Tests to Categorize Performance

One solution is to use statistical tests to identify statistically significant differences in the performance of health care organizations and sort the organizations into performance categories that are fair and meaningful. These categories are especially important if you are showing “relative” data, i.e., showing differences in performance through some kind of symbol, like stars or check marks. Learn more about displaying relative data in Make Graphics Self-Explanatory.

Many health care professionals believe that only statistical tests can tell you which differences really matter. These tests consider the size of the differences among pairs of scores as well as the amount of variation within the group and the number of entities being compared, both of which can have a big impact.

The Challenge of Statistical Tests

If you use this approach, providers will want assurance that they deserve to be in the category where you have placed them, and not in some other (better) category. Normal statistical tests are fine for most purposes, but you may need to take steps to confirm the appropriateness of your categorization process. Options include:

Hierarchical modeling. If some of your entities are very small compared to others, consider the use of a more sophisticated statistical technique called hierarchical modeling or “smoothing” that increases the validity of categorization for these very small entities.^[1]
Tests of misclassification. Another sophisticated approach is to conduct tests to identify the possibility that a particular provider or plan has been misclassified. When the Massachusetts Health Quality Partners were preparing to publicly reported data on patients’ experiences with physician practices for the first time, researchers assuaged the providers’ concerns by conducting tests that confirmed the accuracy of the performance categories for the physician practice.^[2]

Going Beyond Statistical Tests

Sometimes, when the sample size is large enough, very small differences in performance will be found statistically significant. In those cases, clinicians and others may believe that the statistical test alone does not suffice to categorize providers, in part because the differences may not have much clinical meaning. From the consumer perspective, focusing on differences that are meaningful substantively as well as statistically is also important.

The solution is to specify a minimum numerical difference in scores that must be achieved, e.g., three percentage points. This judgment-based approach is built into the standard technique for categorizing performance on CAHPS surveys of patient experience.^[3]

What If Everyone Get Categorized as “Average?”

Some report sponsors worry that if statistical tests or more judgment-based tests of clinical significance are used, all the entities in the report will end up looking about the same, or “average.” This would reinforce the idea that there is no meaningful variation across health plans and providers, so there is no need to look at or use quality data. It is also difficult to promote your report if it essentially says that everyone’s about the same. If there is little or no difference in a small proportion of the measures in your report, the lack of variation is not a big problem—but if it is true of more than half the measures, this issue requires some serious consideration.

One approach that immediately comes to mind, but may be ill-advised, is to use a less stringent test of statistical or clinical significance, such as simply ordering the scores and putting the top quarter into a “high” category, the bottom quarter into a “low” category and everyone else in the middle. Providers are likely to be very resistant to this approach, which can also mislead people.To genuinely address this issue, sponsors have to go back to their decision about what measures to include in a report. One criterion for selecting measures should be their ability to discriminate levels of performance. A report in which everyone looks the same on many of the measures is simply not informative or useful.

^[1] Shahian DM, Normand SLT. Comparison of “Risk-Adjusted” Hospital Outcomes. American Heart Association Inc. Circulation 2008 April. 2008;117:1955-1963.
^[2] Safran DG, Karp M, Coltin K, Chang H, Ogren J, Rogers WH. Measuring patients’ experiences with individual primary care physicians. Results of a statewide demonstration program. Journal of General Internal Medicine 2006. 21(1): 13-21.
^[3] Elliott MN, Zaslavsky AM, Goldstein E, Lehrman W, Hambarsoomians K, Beckett MK, Giordano L. Effects of Survey Mode, Patient Mix, and Nonresponse on CAHPS Hospital Survey Scores. Health Services Research 2009. 44(2p1): 501-508. Also see: Cohen, J. Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. 1988.
Farley DO, Hays RD, Elliott MN. When Does A Plan Differ Significantly from Other Plans? Statistical and Practical Significance Criteria of Star Ratings for the New Jersey CAHPS Survey. RAND Corporation unrestricted draft 1997. Document Number: DRU-1885-AHCPR. Available at http://www.rand.org:80/pubs/drafts/DRU1885/.

Also in "Choosing a Point of Comparison"

Choosing a Comparator
Categorizing High and Low Performance
Ordering the Data
Explaining Your Comparisons

Browse Topics

Topics A-Z

Priority Populations

Programs

Research

Publications & Products

Research Findings & Reports

National Healthcare Quality and Disparities Report

Data & Analytics

Tools

Funding & Grants

Notice of Funding Opportunities

Research Policies

Funding Priorities

Training & Education Funding

Grant Application, Review & Award Process

Post-Award Grant Management

Contracts

AHRQ Grants by State

PCOR

News

Newsroom

Blog

Newsletter

Events

About

About AHRQ

Organization & Contacts

SHARE:

Categorizing High and Low Performance in Quality Reports

Using Statistical Tests to Categorize Performance

Going Beyond Statistical Tests

What If Everyone Get Categorized as “Average?”

Also in "Choosing a Point of Comparison"