Overarching Methodological Issue: Performance Misclassification
Methodological Considerations in Generating Provider Performance Scores
A. What is performance misclassification?
The misclassification of provider performance is an overarching methodological issue in creating performance reports. Performance misclassification refers to reporting a provider's performance in a way that does not reflect the provider's true performance. For example, a report may contain three performance categories (e.g., bottom quartile, middle two quartiles, and top quartile), and for a given provider, performance may be reported as being in category 1 when true performance is in category 2.
Misclassification is a familiar concept in legal proceedings. Courts are imperfect: they sometimes convict the innocent and acquit the guilty. However, despite the presence of this misclassification, courts are generally believed to serve a useful social function.
Misclassifying providers is distinct from displaying performance results in a way that is difficult to understand and that confuses patients and providers. Even if a report is perfectly clear, and each patient and provider thoroughly understands its contents, performance misclassification can still lead to suboptimal results. For example, patients may go to truly low-performing providers, thinking they are high performing (as shown in Figure 3). Performance misclassification also may lead some low-performing providers to falsely believe that they have high performance, discouraging efforts to improve.
B. Why is performance misclassification important?
In a sense, all reports of provider performance classify the providers. For example, providers can be classified relative to each other or relative to a specified level of performance (e.g., above or below national average performance). Provider rankings are also a kind of classification system, since each rank is a class. Even reports that show performance scores and confidence intervals enable users to classify providers by comparing their performance. For example, report users might be able to see whether a provider's performance is different from the average performance. Alternatively, report users might just rank providers' performance scores, ignoring the confidence intervals.
Fundamentally, reports that misclassify the performance of too many providers (and misclassify them by too great an amount) may prevent the reports from having their best possible impact on health care received by a Chartered Value Exchange's (CVE) local patient population. For example, as shown in Figure 3, if greater shares of providers are misclassified, then more patients may choose low-performing providers, erroneously believing that they are high performing. Performance misclassification may even cause patients to leave high-performing providers, disrupting clinical relationships.
Higher rates of misclassification also will lead more providers to receive the wrong messages from performance reports. More low-performing providers may not attempt to improve, mistakenly believing that they are high performing. Providers also may prioritize the wrong areas for improvement, devoting scarce resources to areas in which they are truly doing fine and ignoring areas in which they could truly improve—again, because the report has misclassified their performance. Finally, high degrees of performance misclassification may threaten the stakeholder coalitions that are central to the success of a CVE.
C. What causes performance misclassification?
There are two general types of performance misclassification, and they have different causes. The first type is systematic performance misclassification. This kind of misclassification occurs when, for example, provider performance ratings are influenced by something beyond the provider's control (such as unusually high numbers of older patients). If providers are measured on the mortality rates of their patients, then the measured performance of providers with older patient populations will systematically look worse than their true performance. This is because older patients tend to have higher mortality rates than younger patients, all other things being equal. Most CVE stakeholders will agree that this kind of systematic performance misclassification is undesirable.
Baseball presents a useful analogy for thinking about systematic misclassification. Batters in an unusually competitive part of the league may face unusually skilled pitchers. If we only look at their batting averages, without paying attention to the pitchers they faced, the measured performance of these batters will systematically look worse than their true performance. Similarly, batters in less competitive parts of the league may face relatively unskilled pitchers (so getting a hit is relatively easy), and their measured performance will systematically look better than their true performance.
The second type of performance misclassification is misclassification due to chance. This kind of misclassification occurs because any time performance is measured, there will always be some amount of random measurement error. The unavoidable presence of measurement error means that for every provider in a report, any report that contains more than one category (i.e., a report that enables any kind of comparison between providers) will have some risk of misclassification due to chance.
Many CVE stakeholders may already have discussed misclassification due to chance without realizing it. Debates over "minimum sample sizes" (or how many patients need to be included in a performance measure before it can be reported) are an intuitive way to think about misclassification due to chance. Stakeholders would be right to wonder, "If a given provider has only had a handful of patients, how can we really know anything about the provider's true performance?"
Baseball also presents a useful analogy for thinking about misclassification due to chance. Suppose a rookie has an especially good first game, getting a hit in 3 of 4 times at bat. Based on this first game, his or her batting average will be 0.750. Are we then to assume this player is the greatest hitter of all time? If we only look at the batting average without paying attention to sample size (n = 4), we would have no other choice.
Most people will intuitively agree that induction into the Baseball Hall of Fame would be premature. Regardless of skill, the rookie was probably lucky in the first game. Baseball players' batting performance can vary dramatically from game to game, and the observed batting average of 0.750 far exceeds the full range of batting averages normally seen, even among great players. Therefore, classifying the rookie as "greatest ever" would run a very high risk of misclassification due to chance.
It is fair to ask: "How many times at bat would be needed before we know how good a baseball player really is?" In other words, how many observations are needed before we feel reasonably confident about predicting the player's future performance? Thirty times at bat? One hundred times? A season? Multiple seasons? The best answer may depend on the purposes for which the performance data will be used and on the player's calculated risk of performance misclassification.
If performance data will be used to decide whether to include a player in a team's starting lineup for just a few games, then a relatively high risk of misclassification may be tolerable. On the other hand, if performance data will be used to offer the player a multiyear contract, then team managers may be willing to accept only a small risk of misclassification. When millions of dollars and multiple seasons are on the line, they will probably want as much certainty about future performance as possible.
The statistical issues in this baseball example are similar to the methodological issues facing CVEs, community collaboratives, and other organizations interested in creating public reports of performance of health care providers. Readers interested in more detailed information about the ways performance can be misclassified are encouraged to consult two appendixes to the report: Appendix 1: Systematic performance misclassification, and Appendix 2: Performance misclassification due to chance.