Page 1 of 1

Monitoring and Evaluating Medicaid Fee-for-Service Care Management Programs

Chapter 3. Evaluation

The choice of evaluation design has implications for many aspects of the evaluation and should be considered carefully.28-30 The following paragraphs discuss the basic components of the evaluation.

Evaluation Action Steps

  1. Select a reference group or groups.
  2. Structure the evaluation.
  3. Select analytic methods.
  4. Identify and address potential confounding factors.
  5. Select measures.
  6. Identify and address data issues.
  7. Consider sample size.

Action Step 1: Select a Reference Group or Groups

A reference (or control) group is an equivalent comparison group that was not subject to CM (the intervention). A reference group provides a basis for comparison between it and the intervention group (people receiving CM) and allows for an assessment of what the effect of a program has been. The use of an appropriate reference group is an essential part of a credible evaluation. Table 2 provides a description of three different types of reference groups.

How a reference group is selected will depend on the design of the CM program. For example, it is much harder to select a reference group when CM is applied to a select group of individuals who volunteer for CM (e.g., in an opt-in program). This is true because people who volunteer may be different in important ways from people who do not volunteer.

Another factor that should be considered has to do with the availability for analysis of certain types of data, which may be influenced by the type of reference group selected.

Reference groups can be selected purposefully before the CM program is implemented (prospective controls), or the evaluation may have to look back and try to find an appropriate reference group after the program is implemented (retrospective controls). The reference group needs to be very similar or equivalent to the intervention group. This reference group can either be pulled from the potential CM population, or a separate group can be identified that is as similar to the CM group as possible. In the absence of an independent reference group, a pre/post analysis can be used that compares the impact of CM on a group with the time prior to CM, if the before period, when projected forward, is an accurate prediction of what would have happened in the absence of CM.31

Table 2. Types of reference groups

Reference groupsDescriptionAdvantagesChallenges
Randomized control groupParticipants randomly assigned to treatment and reference groupsGold standard; most rigorousMay pose ethical and/or political concerns; does not protect against the impact of changes in provider practice
Staged implementationProgram rolled out in certain areas before othersMay be more feasible than a randomized control group; some differences can be controlled for through statistical analysesMust ensure the population groups in the roll-out areas are similar to the groups in other areas, and that certain policies do not affect one area differently from another
Matched controlReference group selected that is as similar as possible to treatment groupMay be more feasible than a randomized control group; some differences can be controlled for through statistical analysesMust consider all factors that may affect the outcome independent of the intervention

Although a reference group is essential for evaluating the impact of the CM program, it may not be necessary for program monitoring. Because of the expense and difficulty of maintaining a reference group for long periods of time, before/after comparisons may be better suited to assess process measures related directly to the intervention as part of program monitoring; however these comparisons may potentially be misleading due to regression to the mean, selection bias, and other factors discussed below.32

Randomized Control Trials

The randomized controlled trial (RCT) is considered the gold standard of evaluation designs, and it provides the most definitive results. However, an RCT may not be feasible or practical in all instances.

In an RCT, participants are randomly assigned to treatment and reference or control groups. The random assignment helps decrease the possibility that the treatment and reference groups are systematically different or nonequivalent. Since the treatment and reference groups are pulled at random from the same population, any differences in outcomes between the groups are assumed to be due to differences in the receipt of the intervention (in this case, CM). In addition, since the treatment and reference groups are operating concurrently, the design protects against external factors or evolution of treatments that translate into differences in treatment patterns over time (i.e., these "co-interventions" become available to both treatment and reference groups equally).

A randomized control group can be used in voluntary CM programs, if people are randomized to CM and reference groups after they volunteer. This eliminates any selection bias caused by differences between volunteers and those who don't volunteer for CM, but this method is not without political risk.

For programs that are designed to change physician behavior, RCTs do not automatically protect against the influence on care that may occur as physicians and other providers learn from the intervention and begin to offer a different (higher) level of care, not only to participants in the treatment group, but also to control patients. This problem can be reduced by assigning participants to the treatment and reference groups in groups, based on the identity of their primary care physician. Subtleties in the design and evaluation of RCTs are typically unfamiliar to Medicaid personnel, so consultation by regional (often university-based) experts is important to consider prior to deciding on a particular study design.

As previously noted, many Medicaid programs have concerns that the RCT may pose ethical or political concerns resulting from some of the potentially eligible population being allowed to receive the treatment sooner than others. (The reference group may have volunteered, but will not receive the CM services.) For many Medicaid programs, this will require a CMS waiver of State-wideness that will allow them to partially implement a program to a subset of eligible people. It will also require the political will to ask for participation in a program where only half of the subjects will receive the intervention—although it is important to remember that the effectiveness of CM is still unproven, and the potential ethical concerns are not the same as withholding a service with established efficacy.

Conversely, a decision to use a less rigorous evaluation approach could be considered more of an ethical dilemma if it leads to inaccurate conclusions that a program is saving Medicaid dollars when, in fact, it increased spending and had only modest effects on quality of care and the health of members. If such a finding diminishes enthusiasm to look for opportunities to improve CM further, it could be a detriment to the future health of the State. Moreover, controls in an RCT are typically offered the current standard of care delivery (i.e., no ongoing care is withheld) until some future point in time after which the State will plan to offer CM to all eligible and interested members, should it prove effective in enhancing care. On a more practical level, some programs have managed to solve this problem by explaining to potential participants that there is insufficient capacity to take all interested parties at once, and the program will be phased-in over time.

Lessons from the Field

Indiana Medicaid chose to use an RCT in conjunction with an observational analysis of staggered implementation with repeated measures. Most CM evaluations have used observational designs in which the evaluators have no control over who receives the CM intervention and who does not. An observational design was used in the central, northern, and southern regions of the State. The RCT was used in two large urban group practices where enrollees' start dates were randomly staggered according to clinic sites. The RCT was intended to help identify and measure potential biases that might have impacted results obtained from the observational design.16

Alternatives to RCTs

Even when an RCT is not practical, there are other effective means (e.g., staged implementation and matched controls) of including reference groups that, if appropriately designed, can strengthen your evaluation and be more feasible to implement. Although these alternatives are more subject to selection bias and other limitations than RCTs, they still can be helpful in identifying and isolating program effects and offer advantages over actuarial adjustment alone.

In a staged implementation, as in Indiana, where a program rolls out in some communities before it goes Statewide, it may be possible to use individuals as a reference group who may be eligible for the CM program but are located in an area where it is not yet available. Ideally, there will be no important differences in this reference population and the population receiving CM services other than the geographic area. However, it will be necessary to look at prior year trends to make sure the two groups were similar before the CM program began. You also will need to carefully consider whether there are any in effect during the implementation period that could impact costs and quality differently for the reference group. Some differences can be controlled for via statistical analysis.

In other cases, it may be necessary to identify and select a separate reference group. A matched control is selected to be similar to the group receiving CM services in as many ways as possible. Matched controls may be selected to be similar on the basis of demographic characteristics (age, sex, socioeconomic status) and disease state. In particular, factors such as use of health care services in the prior year and health habits are important to consider. In general, all factors that may affect outcomes independent of CM should be considered in selecting the reference group. As in a staged implementation, if the sample cannot be matched on all characteristics, some differences can be controlled for in a statistical analysis. However, the less reliance there is on statistical modeling of this type, the more compelling and robust your estimates will be.

Lessons from the Field

North Carolina Medicaid, with the Cecil G. Sheps Center for Health Services Research at the University of North Carolina, conducted an evaluation using a matched control. The study compared the costs and utilization of Medicaid enrollees with asthma and diabetes in the CM program to enrollees with the same conditions in ACCESS, the State's traditional PCCM program. Because there were significant differences in the ages of enrollees with asthma between the two programs, the evaluators used age-adjustment throughout.33

Return to Contents

Action Step 2: Structure the Evaluation

There are two main structures you may consider when designing your evaluation: cross-sectional evaluations and longitudinal evaluations.

Cross-Sectional Evaluation

A cross-sectional evaluation is done at a single point in time, presumably after the CM program has been implemented. In order to make the case that the CM program had an impact, it is necessary to compare those that received CM with a group of people that did not receive CM—i.e., the reference group. Presumably, the reference group will be as similar as possible to the CM (or intervention) group. It is often a challenge in cross-sectional evaluation designs to make sure that the reference group is comparable, and analyses often need to statistically control for potential sources of differences.

Your evaluator may look to how programs were implemented—if implemented partially (in only one region of the State) or in a staged manner, the evaluator may have the convenience of a ready made reference group, provided that beneficiaries in the program and reference group had similar prior utilization and cost patterns. In fact, whenever claims data are available for the period of time prior to the implementation of CM, it is advisable to test out the analytic approach and model using these data. You may want to raise this idea with evaluators to determine if it is feasible, as it will strengthen confidence in the final analytic results.

Another challenge in cross-sectional evaluation designs is to make sure that the data to be used in the evaluation are comparable between the CM and reference groups. Since program data are often used as part of the evaluation, obtaining comparable data on the reference group may be a challenge. Remember to take care to ensure that differences in findings are not due to data differences. This is a critical, and difficult to address, issue.

Lessons from the Field

The North Carolina evaluation conducted by the Sheps Center at the University of North Carolina used a cross-sectional evaluation design. The evaluation compared enrollees with asthma and diabetes receiving CM to a similar group of enrollees in the State's PCCM program who were not receiving CM. The two groups were compared during a single point in time (2000-2002). The evaluators underscored the importance of adjusting for differences in the enrolled populations.33

Regression to the Mean
The phenomenon known as "regression to the mean" is a particular challenge to comparability in pre-post designs (before/after comparisons) when the criterion for eligibility for CM is high medical costs. Since a group of people who all have high medical costs in one year will tend to have average costs that are considerably lower in the following year, it is often a challenge to separate out differences that may have been caused by the CM program from other factors—such as high quality health care in the community, the natural history of the disease, or random fluctuation between one year and another—that together are often called "regression to the mean." The use of an equivalent reference group can help you separate regression to the mean from true program effects.

Longitudinal Evaluations

Longitudinal evaluations evaluate differences before and after implementation of a CM program. The evaluation may include data from only one point in time before the implementation and one point after implementation, or it may follow participants through several stages post-implementation.

Longitudinal evaluations also look to compare the CM group with a reference group. In a simple pre-post design, the sample is used as its own reference group, and characteristics of the sample before and after are compared. While a pre-post design ensures similarity between the treatment and reference groups, there are potential confounding factors that must be considered. For example, if the standards of care have changed since the CM program was implemented, it will be important to separate out changes that may be due to CM from changes that may be due to changing standards in care (such as new guidelines or the introduction of new drugs or treatments), or simply the aging of the population.

Lessons from the Field

The Disease Management Association of America (DMAA) evaluation guidelines reinforce the importance of transparency in evaluation methodologies. CM program evaluators should be able to clearly explain not only the methods used but also the impact these methods have on the interpretation of results. For example, including a discussion of the limitations of a pre-post design will help others to interpret the findings with regard to both the strengths and weaknesses of the study. The DMAA acknowledges the challenge of striking a balance between rigorous methods and a feasible evaluation design. Acknowledging how you have dealt with this challenge will help in understanding the evaluation results.34

Washington and Pennsylvania used a longitudinal design (pre-post analysis) to evaluate cost savings in their CM programs. They compared the costs associated with the population targeted for CM during the measurement year to a baseline reference group of individuals who met the criteria for CM in the year prior to implementation.12,22

In a cohort design, both reference and intervention groups are followed over time. Any differences between the two groups are presumed to be a result of CM, since both groups may be subject to the same environmental pressures, such as changes in the standards of care over time and the possibility of regression to the mean. Also, inherent differences between the two groups that persist can be eliminated by comparing changes rather than absolute levels. Table 3 provides a comparison of different evaluation design options.

Table 3. Comparing evaluation designs

DesignReference groupBiasConfoundingValidityAbility to generalize
RCTRandomly selected "eligibles"Very lowVery lowVery highLow to high
Quasi-RCTNonrandomly selected "eligibles"LowLowHighLow to high
CohortNaturally excluded "chronics"ModerateModerateModerateHigh
Naturally excluded "nonchronics"HighHighLow to moderateHigh
Pre-PostIntervention group in an earlier time periodLowModerateModerateModerate
ActuarialPredicted cost trends of "nonchronics"Very highVery highLowHigh

Source: Adapted from a presentation by Ackerman RT, to the AHRQ Learning Workshop, Nov 2, 2006.

Return to Contents

Action Step 3: Select Analytic Methods

The analytic methods used to conduct an evaluation are also important and should be considered up front. Most economic evaluations use statistical methods to assess differences between CM and reference groups on measures of interest. These statistical methods estimate the costs of individuals receiving CM compared with individuals in the reference group, controlling for other factors that are considered important to the outcome. An alternative is to use actuarial methods, which project expenditures for groups of individuals adjusting for factors that might impact cost trends.

The evaluation strategy should specify whether total costs or disease-specific costs are being addressed. Because of the challenges associated with parsing out health care costs associated with a particular disease, and because of a high prevalence of comorbidities in the CM population, most evaluations examine total costs.

When estimating effects on costs, the unit of analysis should be the individual's cost per month observed throughout the year ("per member per month" analysis). The cost per month should be weighted by the proportion of the year they are observed (i.e., enrolled in Medicaid). This will allow the evaluator to adjust for differences in the CM and reference groups in the proportion of people who leave the program during the year, including those who die or are otherwise lost to observation. This is especially important in a Medicaid population, where people move in and out of Medicaid eligibility. If there appear to be differences in mortality between the CM and reference groups, this may suggest that the two groups are not very comparable. It is highly unlikely that CM would have a significant effect on mortality within a year or two. In addition, since people who die are much more expensive during their last year of life, a difference in mortality between the two groups could lead to significant differences in costs.

Another issue to consider is how outliers will be handled. Outliers are extremely expensive cases that have the potential to skew the results; in some cases it may be reasonable to truncate the expenditures for outliers. However, outliers do impact the total costs to Medicaid, and it is possible that a CM program could impact the number of outliers. For example, if CM is expected to reduce the number of outlier cases by rationalizing care, then truncating outliers may mask the potential impact of CM.

Return to Contents

Action Step 4: Identify and Address Potential Confounding

In spite of the best efforts to select reference groups that are similar, sometimes there are confounding factors that may impact the outcomes. Confounding factors can be differences between the CM group and the reference group, environmental factors, or other obstacles. It is important to select a reference group that minimizes the potential for confounding and to identify any potential confounding factors and control for them in statistical analyses whenever possible.

Your statistical analyses could include risk stratification, where the CM and reference groups are further divided according to disease status, and separate comparisons are made for people in different disease states. Other multivariate statistical methods, such as multivariate regression analysis, could also be used to control for the effect of differences in population characteristics or to estimate the impact of environmental factors that could affect outcomes. Remember, most efforts to control for potential confounders rely on the availability of data about these other factors. A careful evaluation should consider its limitations, which will include the extent to which possible unmeasured confounders may have impacted the results.

Return to Contents

Action Step 5: Select Measures

The mechanism by which CM achieves its effects is thought to be two-fold: either by improving or rationalizing the use of health care services or by reducing the likelihood of adverse events or preventing further decline in health (and thus reducing the need for additional health care services). Your evaluation should consider both mechanisms, and measures should be chosen that link the goals and objectives of CM to potential outcomes. For example, quality measures should align with quality objectives and include both intermediate and long term impacts and financial measures with fiscal objectives.

Measures should also have the potential for change in the timeframe selected for the evaluation. Another important consideration in selecting measures is the availability of data to support the measures. The choice of evaluation design may influence the availability of data. (Go to Action Step 6.) Finally, feasibility is an important consideration. You should select measures that can be calculated with existing data and have demonstrated reliability and validity. In addition, measures that have been used in other studies and other populations, for which benchmarks exist, can add credibility to evaluation findings. Appendix 1 summarizes examples of different types of measures that you may want to consider.

Quality Measures

Quality measures can include measures of access, outcomes, patient experience (satisfaction), processes of care, and/or the structure of the care environment. In evaluating CM programs, it is best to include a mix of measures as there may be many factors other than CM that impact outcomes. In terms of timeline, access, structure, and process measures may be easier to detect in the shorter term and may be easier to link back to the CM program.

AHRQ maintains a Web site with links to a large number of quality measures, many of which are appropriate for assessing the impact of CM (visit AHRQ's Quality Measures Clearinghouse at http://www.qualitymeasures.ahrq.gov/). In addition, CMS has recently released The Guide to Quality Measures: A Compendium.35 Some of these measures are readily available from administrative data. However, many involve new data collection or review of medical records, which can be expensive. It may be possible, however, to collect these data on a random subsample and obtain results that are likely to be very similar to those for the full population, resulting in substantial savings. Since new data collection is so expensive, it is unlikely that baseline or prior history will be affordable or feasible.

Financial and Administrative Measures

Another set of goals for CM is the rationalization or reduction of expensive services and the reduction in health care spending. Appropriate measures include those of use (numbers of hospitalizations and lengths of stay, numbers of emergency room visits, numbers of physician visits) and expenditures for care. Many of these measures are readily available from claims data and other administrative databases. However, administrative and other claims data, like any data, are subject to issues regarding data reliability, quality, and completeness. Administrative data may not reliably capture the information of interest or may capture that information only for subsets of the population. In addition, data may not be available on a timely basis. Claims data in particular are subject to time lags, and data for more complicated care are often subject to a greater time lag in availability.

In assessing the financial impact of CM, it is important to identify financial expenditures for one-time program start-up costs, ongoing administrative costs, and medical costs. All of these costs are legitimate financial expenditures associated with CM. However, the decision to include or exclude some or all of these costs may vary depending on the questions being addressed. An alternative way to consider costs is to separate out fixed costs from variable costs. While fixed costs must be allocated across program participants, it may also be possible to allocate them across the expected life of the program (rather than in a single year) to more realistically distribute these costs.

In Indiana's evaluation, the State distinguished between one-time start-up costs (e.g., office equipment) and ongoing operational costs. The State divided ongoing operational costs into those that were affected by patient volume (variable costs like nurses' salaries and benefits) and those that do not change with patient volume (fixed costs like insurance).16 This allowed the State to frame results for policymakers that included traditional estimates based on total expenditures, as well as estimates that excluded one-time start-up costs, in the event that the State might perceive past one-time expenditures as "sunk" and, therefore, less germane to a decision about continuing to fund the program. Moreover, categorizing ongoing costs as fixed or variable allowed the State to project the impact of hypothetical changes in member volume (i.e. expanded reach of the program beyond the ramp-up period) on future cost-effectiveness estimates.

You also may identify measures to evaluate the process of implementing and operating CM programs. These program process measures may include number of clients per care manager or number of contacts per patient.

Lessons from the Field

States often begin their selection of measures by looking to nationally accepted measures and metrics that other States have used. Texas identified a group of core measures and associated performance corridors. In particular, they identified a group of measures from which they felt they could measure cost-savings. They also developed a supplemental list of measures they believed would be good for additional benchmarking and program monitoring.13

North Carolina partnered with a mini-collaborative of clinicians to select their measures.21 They began by reviewing national clinical practice guidelines, particularly the National Institutes of Health (NIH) asthma guidelines and the American Diabetes Association (ADA) clinical practice recommendations.37 Once they had reviewed clinical guidelines and national measures, North Carolina used several important criteria to guide their measure selection process, namely:

  • Identify measures associated with evidence-based best practices.
  • Measure interventions that have a clinical impact.
  • Choose measures for which data are available.
  • Ensure measures are appropriate for the population (e.g., consider tailoring measures for a pediatric population, identify which continuous eligibility criteria are appropriate).
  • Coordinate measure selection with measurement by other purchasers in the market.

Pennsylvania operates ACCESS Plus, an enhanced PCCM program with a CM component, in rural regions of the State, and HealthChoices, a mandatory managed program, in urban areas of the State. Pennsylvania officials chose a group of Health Plan Employer Data and Information Set (HEDIS) measures that they were already using in their HealthChoices program so that they could draw comparisons between the two programs.22

Return to Contents

Action Step 6: Identify and Address Data Issues

Identifying and obtaining reliable and valid data may be one of the most challenging aspects of evaluation and one of the reasons why it is so important to plan the evaluation early—while it is possible to identify and tailor data for the analyses. While administrative data sources are often preferred because of their low cost and availability, there still are challenges to consider.

Data Reliability

  • Ensuring data reliability is important to the validity of the evaluation results. Once you have taken inventory of available data, you will need to examine whether the data are reliable. In particular, you will need to consider the following questions with respect to the data:
  • Are the data complete?
  • How much data run-out time is needed to ensure you have received a complete data set?
  • How much "data scrubbing" is needed to clean up the data file(s)?
  • Have the data been validated and/or reconciled?
  • Do you need to merge certain data sets to give a complete picture?
  • Are there artifacts in the data related to program modifications, policy changes, or data reporting anomalies?

Baseline data. In cases where CM evaluations are designed after the intervention has been implemented, obtaining adequate and comparable data for the time period before the intervention can be a challenge. In many cases, the intervention results in the capture of new data. However, the lack of these data in the pre-implementation phase can be problematic, since it will not be possible to determine whether changes occurred as a result of CM, unless an RCT design is being used. (In that case, any difference in outcomes between the two groups is assumed to be due to the intervention.)

Comparable data. Obtaining adequate and comparable data for the reference group can also be a challenge. For example, States that compare their CM population to a reference group in another delivery system (e.g., managed care) must ensure that coding practices in their fee-for-service claims data are comparable to coding in their managed care encounter data. Comparable data are critical; without comparability it will not be possible to attribute observed differences to the CM program.

Intention to treat analysis.

If you use a controlled design, it will be necessary to use the same measures to compare the reference group to the intervention group. For this reason, an intention to treat analysis should also be used when defining the comparison groups. Intention to treat means all target enrollees in the intervention period are included regardless of whether or not they received the complete intervention. The same criteria are also applied to the reference group. By using this analysis, you can also account for the extent to which the CM program was successful in attracting and retaining participants.

Return to Contents

Action Step 7: Consider Sample Size

It is important to conduct an evaluation that has the potential to provide credible evidence of result. An important factor in assuring credibility is having an adequate sample size with the power to detect statistically valid differences that result from CM. The sample size needed in an evaluation is directly related to a number of the design factors that have been discussed earlier.

The outcome measures chosen and the expected differences both within and between the CM and reference groups will in large part determine the sample size needed for the evaluation; a power analysis will provide estimates of the sample sizes needed to obtain results.b

The choice of research design and statistical methods used will also determine sample size. Longitudinal designs will frequently require larger sample sizes, since some of the sample will be lost to attrition over time. Complex statistical methods, such as risk stratification, will also require larger sample sizes.

Drawing adequate samples of Medicaid enrollees can be challenging. You may find that the pool from which you want to draw your sample may be too small as a result of a variety of factors, i.e., attrition, loss to follow-up and difficulty contacting enrollees.


bA power analysis is used to identify the necessary sample size to see a statistically significant result, given estimates about differences between the experimental and control groups and the variance in responses.


Return to Contents
Proceed to Next Section

Current as of November 2007
Internet Citation: Monitoring and Evaluating Medicaid Fee-for-Service Care Management Programs. November 2007. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/research/findings/final-reports/medicaid-ffs/medicaidffs3.html