Page 1 of 1

Evaluating the Impact of Value-Based Purchasing: A Guide for Purchaser

A Guide for Purchasers

Example: Pre-Test/Post-Test

An Evaluation of a Type II Diabetes Disease Management Program in an HMO Setting

Health Care Provider. Geisinger Health System, a large mixed-model HMO in Pennsylvania.

Description of Research Activity. In 1996, a steering group of primary care physicians and endocrinologists, clinical nurse specialists, dieticians certified in diabetes, and HMO representatives initiated a diabetes disease management program aimed at better outcomes for patients with diabetes mellitus. This program consisted of several components, including self-management education, coverage for glucometers, extensive database linkage and management, and strong leadership commitment. As part of the program, the group formulated practice guidelines based on widely published professional literature and trained physicians in adopting and following the guidelines. Highly trained and specialized education nurses taught self-management to patients. Program participation was voluntary, but the HMO offered coverage for glucose meters and strips as an incentive for enrollees. The purpose of the evaluation was to assess the impact of the diabetes management program on relevant outcomes.

Evaluators. Initially, the HMO analyzed the data internally, but the process was not as formal as the plan leadership desired. Subsequently, the plan consulted with Pennsylvania State University researchers to carry out an outcomes evaluation of the program.

Research Design. Both the internal analysts and the consultants used a pre-test/post-test design to track changes in clinical, cost, and health status measures over time.

Data and Measures. The group collected an extensive set of clinical, administrative, and patient quality-of-life data at inception and at periodic followup points for enrolled patients and later linked the data to allow for statistical analysis. The internal analysis compared mean HbA1c levels and diabetes expenditures before and after the intervention. The steering committee received these reports on a monthly basis.

The more formal analysis examined diabetes outcome-related variables such as HbA1c levels and cardiovascular clinical variables such as HDL and LDL, self-reported health status (based on responses to the SF-36® survey), and degree of compliance with the guidelines as evaluated by the specialized nurse's review of the patients' medical charts.

Methods. For continuous variables, the evaluators used t-tests to detect differences in the means of the outcomes variables of interest (e.g., HbA1c levels) from pre-intervention to post-intervention. They used chi-square tests for categorical variables, such as the occurrence of episodes of hypoglycemia.

Results. The most salient finding so far is the significant improvement in the clinical indicators of the program (such as HbA1c, HDL, and LDL). This does not appear to translate into patient-perceived improvements in health status, physically or emotionally, although there have been some improvements on the mental health and vitality scores, as illustrated in the data below.

SF-36 DomainsTime 1Time 2
MeanS.D.MeanS.D.
Physical Functioning67.9527.268.9728.55
Role-Physical59.7141.4960.8142.62
Body Pain Index65.2726.7863.7826.44
General Health Perceptions60.1919.9559.6821.04
Vitality53.3420.2456.1119.971
Social Functioning81.6922.3482.2323.37
Role-Emotional72.8638.9274.6337.83
Mental Health Index69.9518.3472.9916.731

1p < .05 for paired t-tests.

Advantages and Disadvantages of the Evaluation Strategy. A comparison of clinical indicators over time was a clear and easy way to assess the results of the intervention. However, because the project design lacked a comparison group, it is not certain that the results are attributable to the intervention (i.e., the disease management program). Although a comparison group would be desirable, ideally a randomly assigned one, this would require that some patients not be eligible for the diabetes management program; for a number of reasons, the health plan would not consider this option.

Source: Geisinger Health System.

Cross-Sectional Design With Comparison Group (or Static Group Comparison). This design is similar to the cross-sectional design discussed earlier in that observations are made only after the intervention has been implemented. However, in this variation, a comparison group is introduced. That is, evaluators identify and observe a comparison group that is similar to the group or population under study, but has not received the VBP intervention. The assumption is that what is observed for the comparison group is what would have been observed in the intervention group in the absence of the intervention. In this sense, the comparison group provides a measure of what was "expected" in the absence of an intervention, which can be compared with what was actually observed for the intervention group.

Select for Figure 4 (5 KB).

To implement this design, researchers gather observations at the same point in time for the treatment and comparison group, using the same measurement approaches and variable definitions. They can make one observation after the intervention, or multiple observations over time. As with the pre-test/post-test design, evaluators would use multivariate analysis to test for statistically significant differences in outcomes between the intervention group and the comparison group. If data are not available at the individual level (e.g., data only exist at the hospital level), there may be another level of observation that will permit multivariate analysis. For example, in a situation where an intervention occurred at 100 hospitals and the comparison group was composed of 100 hospitals that did not receive the intervention, you could conduct the analysis using the hospital rather than the individual as the unit of observation as long as you can control for hospital characteristics and hospital-level measures of casemix.

When this is not possible, statistical testing may still be feasible if the evaluators know the appropriate sample so that they can construct standard errors around the estimated means. For example, suppose that the outcome of interest is mortality per 1,000 admissions and only aggregate data are available (e.g., mortality rates for all hospitals in the intervention and comparison groups). The evaluators can test whether mortality per 1,000 admissions in the intervention group differs from that in the comparison group (assuming a similar casemix) if they have access to data on the number of admissions or deaths for hospitals in the intervention and comparison groups.

In still other cases, the individual data may not be available and may not even be meaningful to consider; so statistical tests are not possible. For example, suppose that you want to know the impact of a VBP activity on premiums; the intervention group consists of one employer offering one plan and the comparison group consists of another employer offering another plan. You can observe if the price of the plan offered to the VBP employer is lower than the price in the comparison group, but you cannot statistically test this proposition. A statistical analysis can only be done using the plan as the unit of observation if there are a sufficient number of plans (and perhaps employers) in the treatment and comparison group so that the analysis can use aggregated data at the level of the plans, or even the employers.

Selecting a Comparison Group

Evaluators can use any of several methods to select a comparison, or control, group. In a case control approach, the comparison group is chosen to match the intervention group on specific characteristics thought to be important. Another approach is to pick a population that is thought to be similar to the intervention group and for which data are available for comparison purposes. A third approach is to use national or regionally available statistics as standards for comparison (e.g., NCQA's Quality Compass database of HEDIS® measures).

Sometimes groups within a population are randomly assigned either to receive the intervention or to be in a comparison group. Although random assignment to groups provides the highest level of control and strength regarding the ability to establish causal relationships, it is difficult to use this approach with VBP activities since the randomization would likely have to be at the level of a clinic, hospital, or subset of employees. For both business and political reasons, it is rarely feasible to treat these organizations or individuals differently.

Advantages of This Approach. In contrast to designs that do not use a comparison group, this design allows the evaluator to draw stronger inferences regarding the impact of the VBP activity. In addition, since this design does not involve observations made before the intervention is implemented, only post-intervention data are required. As a result, this design can be used in cases where a purchaser did not consider conducting an evaluation until after the intervention had already been implemented.

Drawbacks of this approach. he weaknesses in this design relate to the extent to which the comparison group can be assumed to be just like the intervention group except for exposure to the intervention. Any observed difference between the treatment and comparison group could represent an intervention effect; yet observed differences are also cause for concern as they might be indicating group rather than intervention effects. In some cases, these concerns may be addressed statistically. For example, if the intervention group and comparison group each consists of many hospitals, the analysis might control for hospital characteristics (although this requires a sufficient number of hospitals).

Casemix adjustment is a prime example of the kind of statistical controls that may be necessary. Yet, even after casemix adjustment, there may be other differences between the two groups that are not observable to researchers. The literature on small area variations suggests that these differences could be very substantial (Wennberg and Gittelsohn, 1973; Feldstein, 1993). Thus, any differences in observed outcomes may in fact be due to selection bias differences (i.e., the comparison and intervention groups differ on variables not observed by the researcher) or differences in exposures other than the VBP activities.

Options Available for Casemix Adjustment

There are a variety of systems available today to meet the diverse needs for casemix adjustment. Purchasers should be aware that the adjustments for costs may differ from the adjustments for quality outcomes. Also, different quality measures may require different types of casemix adjustment.

The most straightforward casemix adjustment is for age and gender. Other adjustments include comorbidity indices; several systems exist at the general population level, such as the "Johns Hopkins ACG (Adjusted Clinical Groups) Case-Mix System" (Johns Hopkins University, 2001). Other indices exist for casemix adjustments in specific clinical areas. This step can be performed at the analysis phase or when constructing the variables.

Another concern is spillover effects (or contamination of the comparison group). If the treatment and comparison groups are in close proximity, the VBP activity may affect both groups. For example, in a situation where salaried workers receive health plan performance information while hourly workers serve as the comparison group, the information might spread to the hourly workers, potentially muting the effects of disseminating the report card. Similarly, if an intervention involves changing provider behavior with respect to HMO patients, the providers may change how they handle all patients, including non-HMO patients. If these providers treat patients in both the intervention group and comparison group, an evaluation could underestimate the effects of the VBP activity.

A related concern arises when activities similar in spirit are occurring in the comparison group. If this is the case, this research design will capture the extent to which the effects of specific VBP activities differ from those of ongoing activities in the comparison group. But this is different than asking how the VBP activity altered outcomes relative to a scenario where no activities occurred.

Example: Cross-Sectional Design With Comparison Group

An Evaluation of the Impact of a CAHPS® Report Card on Medicaid Enrollment

Purchaser. New Jersey Medicaid Office of Managed Care.

Description of the Research Activity. In the summer and fall of 1997, the New Jersey Medicaid Office of Managed Care conducted its first CAHPS® survey of enrollees in its mandatory Medicaid managed care program. The office subsequently published a seven-page brochure, "Choosing an HMO," that compared available Medicaid HMOs based on the collected CAHPS® data. The brochure was included with the enrollment materials sent to half of the newly eligible Medicaid cases during a four-week period in the spring of 1998. All newly eligible cases received the standard enrollment materials, but randomization was used to determine which new cases received the CAHPS® material; the experimental group consisted of 2,649 cases and the comparison group consisted of 2,568 cases. An evaluation was done to determine the impact of the CAHPS® report on the enrollment decisions of newly eligible Medicaid cases.

Evaluators. The evaluation was conducted by academic researchers from the RAND Corporation and Pennsylvania State University.

Research Design. The researchers used a cross-sectional design with a random comparison group to determine whether availability of the CAHPS® report affected enrollees' decisions.

Data and Measures. To examine the impact of the CAHPS® report on plan enrollment and the utility of the CAHPS® report in plan selection, the researchers relied on data from two sources. Plan selection and demographic data for the comparison and experimental groups came from the New Jersey Medicaid Office. Survey data came from a post-enrollment survey of a random sample of newly eligible Medicaid cases in the comparison and experimental groups.

Methods. After the enrollment process was completed, the analysts drew a sample of 2,550 cases from the experimental and comparison groups and surveyed these cases about the CAHPS® report and their enrollment decisions. The followup survey was specifically designed to assess whether the experimental group received the CAHPS® report and incorporated it into the plan enrollment process. Because the experimental design randomly assigned new Medicaid cases to the experimental and comparison groups, t-tests could be used to test for statistically significant differences in the mean values of outcomes for the two groups. The analysts also employed logistic multivariate regression to examine the probability of having seen and used the CAHPS® reports, and to assess why one plan that scored relatively low on the CAHPS® report achieved a high level of enrollment.

Results. Only half of the newly eligible Medicaid cases reported having looked at the CAHPS® report. There were no statistical differences in the pattern of plan enrollment between those who received the CAHPS® report and those who did not. One plan, referred to as the dominant plan, had relatively low ratings on the CAHPS® report but still achieved significant enrollment; this suggests that something about this HMO that was not evident to the evaluators was appealing to Medicaid beneficiaries. When analysts examined enrollment patterns for a subset of the sample that did not choose the dominant HMO and reported looking at the CAHPS® report, they found that these individuals chose better plans on average than did a comparison group. The results suggest that for report cards to be effective at changing plan enrollment, considerable efforts are needed to make sure that consumers receive and read these reports.

Advantages and Disadvantages of the Evaluation Strategy. The primary advantage of this evaluation and research design was the ability to randomly distribute the CAHPS® report to a subset of new Medicaid cases. Randomization controlled for differences in important individual characteristics and allowed the researchers to focus on the effect of the report card.

A major disadvantage was that despite randomization, the study design could not guarantee that all members of the experimental group looked at or even received the CAHPS® report. Since only half of the experimental group admitted to examining the report, it would have been difficult for the evaluation to ascertain an effect in the intervention group. This prompted the analysts to focus the analysis on the non-random subset of the experimental group that admitted to examining the report. An additional limitation was the inability of the design to control for and explain the finding that one dominant plan received significant enrollment despite poor performance on the CAHPS® reports.

Source: Farley DO, Short PF, Elliot MN et al., Effects of CAHPS® Health Plan Performance Information on Plan Choices by New Jersey Medicaid Beneficiaries. Forthcoming in Health Services Research 2002, in press.

Nonequivalent Comparison Group. This approach combines the strengths of the pre-test/post-test design with that of the cross-sectional with comparison group design. In the nonequivalent comparison group design, analysts make both pre-intervention and post-intervention observations for the intervention group as well as for a comparison group that is not receiving the intervention. This design uses the comparison group to control for factors that threaten the validity of the pre-test/post-test design. Similarly, it uses differences between the comparison and intervention group prior to the intervention to control for unobserved factors that would have confounded the cross-sectional with comparison group design.

Select for Figure 5 (9 KB).

Advantages of This Approach. The primary benefit of this design is that it controls for several of the "rival hypotheses" that threaten the other designs described earlier. To the extent that the two groups are the same except for the experience of the intervention, this design controls for trend effects (not if the comparison group doesn't know about the intervention) and simultaneous historical events or exposures (i.e., something else occurring at the same time as the intervention is responsible for any observed changes).

Drawbacks of This Approach. This design is subject to the threat of spillover or contamination effects, which could cause analysts to underestimate any effects from the intervention. In addition, in the absence of random assignment to the intervention and comparison group, it is possible that the two groups are not identical (or "nonequivalent"), leaving aside their exposure to the intervention. However, unlike the cross-sectional comparison group design, differences between the groups are only a threat to validity if they vary over time. The multivariate longitudinal analysis will adjust for any difference between groups that is constant over time. For example, if the intervention group were in an urban area and the comparison group in a rural area, one would expect health care costs to differ between the groups. But as long as the differences are reasonably constant over time, they will not bias the analysis.

This design may also suffer from the threat of selection bias, which occurs when researchers cannot observe all differences between groups or fully understand if those differences are constant or variable over time. However, if random assignment to groups is possible, this design becomes a randomized controlled trial, which is considered the gold standard design for establishing cause-effect relationships in intervention research because it controls for selection bias and all other threats to internal validity.

Randomization could occur at the individual level or at the facility/site level. For example, to facilitate evaluation, individuals could be randomly selected to participate in a disease management program that is a component of a VBP activity, or employers with multiple locations may opt to implement VBP activities in only selected sites. However, for most types of VBP activities and interventions, randomization will not be possible.

Example: Nonequivalent Comparison Group

An Evaluation of the Impact of an HMO Report Card

Purchaser. General Motors Corporation (GM).

Description of Research Activity. During the fall 1996 open-enrollment period (for 1997 enrollment), GM issued its first health plan performance report card to active salaried employees. The report card contained ratings on eight dimensions for each HMO available to active employees: NCQA accreditation status, benchmark HMO, patient satisfaction, medical-surgical care, women's health, preventive care, access to care, and operational performance. For the five dimensions based on HEDIS data, each plan received a designation of one to three diamonds, signifying "below expected performance," "average performance," or "superior performance." Some plans that could not provide HEDIS data received a "no data" designation. Because of the terms of GM's contract with the United Auto Workers (UAW), the company did not provide the report card to active hourly workers. An evaluation was conducted to measure the impact of the report card on enrollment while controlling for other important factors that might affect employees' decisions, such as out-of-pocket price.

Evaluators. The evaluation was conducted by researchers affiliated with Pennsylvania State University and the University of Michigan.

Research Design. The researchers used a nonequivalent comparison group design.

Data and Measures. GM's benefit consultant provided enrollment data files, including plan offerings by ZIP Code of residence and out-of-pocket prices by coverage category, for the period before the release of the report card (1996) and after the release (1997). Employee identification data were encrypted to protect confidentiality.

Methods. The analysts constructed regression models to predict the probability that an employee would enroll in one of the plans available as a function of the out-of-pocket price of the plan and the report card ratings. For statistical reasons, observations on individual employees were aggregated to calculate health plan market shares. The evaluators also performed a regression analysis to see whether and how health plan market share was related to out-of-pocket price and the report card rating variables. Additionally, since hourly employees did not receive the report cards but had access to the same plans with no out-of-pocket cost, the regression analysis included the market share of plans for hourly employees in order to control for other important time variant information unobserved by the researchers.

Results. The results indicate that out-of-pocket price is a significant predictor of the health plans that employees select. The results also suggest that, although employees did not appear to enroll in plans rated highly by the report card, they did seem to avoid plans with many below average ratings, but the effect was not large. The primary implication for purchasers is that report card efforts can influence health plan choices and that employees may be more sensitive to negative ratings than positive ratings.

Advantages and Disadvantages of Evaluation Strategy. The primary advantage of this analytic approach is the ability to isolate the separate effect of price and report card ratings on the probability of enrollment. Less rigorous methods may have improperly attributed plan switching to the report card. The primary disadvantage is the technical sophistication and time involved in performing such an analysis, and the assumption that hourly and salaried employees have similar plan preferences.

Source: Scanlon DP, Chernew M, McLaughlin C et al., The Impact of Health Plan Report Cards on Managed Care Enrollment. Journal of Health Economics 2002;21(1):19-42.

Time Series. The time series design addresses the important issue of underlying trends. In this approach, evaluators capture information on trends underway by making multiple observations before the intervention is implemented. They then make one or more observations after the intervention is implemented, and conduct an analysis to establish the trend and test whether the VBP activity caused a deviation from the trend.

Select for Figure 6 (9 KB).

Most time series analyses compare aggregate data (usually in the form of some proportion or rate) over time. The unit of time represented by each observation will vary across evaluations, depending upon available data and the type of intervention being evaluated. Within a single evaluation, however, the units of time whether they are years, quarters, months, or weeks intervention is implemented. They then make one or more observations after the intervention is implemented, and conduct an analysis to establish the trend and test whether the VBP activity caused a deviation from the trend. should be the same for all observation points in the time series.

The basic specification for this analytical approach assumes that the VBP activity affects the level of the outcome and that this effect persists over time. It also assumes that the VBP activity does not alter the trend. Modified specifications could allow the VBP activity to affect the trend and the level, and even more complex specifications could test for persistence of the effect.

Advantages of This Approach. The strength of this design is its ability to establish whether or not a change in the outcomes being measured is the result of a trend already underway or the intervention under investigation. This approach can be contrasted with a pre-test/post-test design, which cannot reveal whether the single observation after an intervention is the continuation of a trend.

Drawbacks of This Approach. The extent to which this design adequately controls for external time trends depends on the number of periods observed prior to the intervention (and to a lesser extent after the intervention) and the stability of the trend. In addition, any external historical factor or exposure that occurs contemporaneously with the VBP intervention will confound the results. Another weakness of this design is that it requires that data be gathered or estimated in the same way and available over multiple periods of time, including a significant time period before the intervention.

Example: Time Series

Use of a Control Chart To Begin and Maintain an Asthma Disease Management Initiative

Health Care Provider. Allegiance L.L.C., a physician-hospital organization (PHO) in Ann Arbor, MI.

Description of the Research Activity. This PHO contracted with two different HMOs to assume complete financial risk for the expenditures of a managed care population. As part of its effort to improve care management, the PHO established a goal of reducing hospitalizations and emergency room visits due to asthma. Beginning in 1995, it initiated a number of interventions, including grand rounds, newsletters, semi-annual feedback reports to primary care physicians listing their patients who might benefit from use of a steroid inhaler (based on pharmacy refill data), and peer pressure from physician leaders on colleagues with low rates of steroid inhaler utilization.

Despite these initiatives, little consistent progress was made on any asthma-related metric. Still concerned about improving asthma care, the physicians and hospital partners approved funding for an asthma nurse position (one FTE) that began in June 1999. The justification was that some of the $500,000 spent annually on asthma-related hospitalizations could be reduced through increased use of steroid inhalers (Donohue et al., 1997), which would result from patient and physician detailing by the asthma nurse. Other interventions initiated around June 1999 included a monthly feedback report to physicians and supplemental academic detailing by several utilization management nurses. The purpose of the evaluation was to assess the impact of this care improvement initiative.

Evaluators. Analysis of the asthma initiative is done internally. The executive committee of the PHO, subsequently referred to as the "decisionmakers," examines the analytic evidence when approving each year's budget.

Research Design. The evaluators used a time series design relying on control charts to track relevant outcomes in the months before and after the new initiative.

Measures used were:

  1. The percentage of bronchodilator patients (those taking three or more canisters in a 6-month period) who were also taking a steroid inhaler (according to pharmacy claims data).
  2. The percentage of asthma patients visiting an emergency room or hospitalized (according to medical claims data).

Methods. The evaluators used control charts to track progress on the measures. The mean of the data points plus the upper and lower control limits (3 sigma) are represented by horizontal lines.

Results. Thus far, there is no demonstrable improvement in the number of hospitalizations or emergency room visits. However, the proportion of bronchodilator patients on steroid inhalers increased, coincident with the staffing of the asthma nurse and other interventions begun in June 1999. (The control chart below displays a shift in data points, including several above the upper control limit, which indicates statistically significant changes in the delivery process.) This temporal improvement was sufficient to convince the decisionmakers to continue funding the nurse position despite considerable downsizing in the organization.

Advantages and Disadvantages of the Evaluation Strategy. Use of control charts permitted simple yet frequent assessments. But the lack of a concurrent comparison group weakened the argument for causality. This method also makes it difficult to determine which of several simultaneous interventions had the biggest impact.

Source: Allegiance L.L.C.


Proceed to Next Section
Return to Contents

 

Current as of May 2002
Internet Citation: Evaluating the Impact of Value-Based Purchasing: A Guide for Purchaser: A Guide for Purchasers. May 2002. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/professionals/quality-patient-safety/quality-resources/value/valuebased/evalvbp4.html