February 24, 2010: Morning Session (continued)
Ms. Wachino: The only thing I'd add to that, and it's a smaller subpoint, and I probably should have mentioned it earlier, is that the State reporting on the initial core measure set is voluntary for the next several years, which I think is really important—because I think it's important for us to learn where States are and what they're able to do easily and what's hard, moving gradually over time to consistent national reporting. And I am glad that CHIPRA gives us the early experience of working with States who are volunteers and the time to really learn what works and what doesn't as we move forward towards consistency because as I said earlier, it's very, very tough.
Dr. Dougherty: Hello, who's there?
Dr. Miller: Oh, hi, I'm sorry. This is Marlene Miller.
Dr. Dougherty: Oh, hi Marlene. Thanks.
Dr. Miller: I was a little bit late starting out, but I also wanted—I signed on a few minutes ago, but the last speaker I could barely hear.
Dr. Dougherty: Okay. You mean during the comment, during the Q&A period, or while she was speaking at the mic? Which you don't know where she is, so that's hard for you to answer. Okay. Can you make sure your mic is on, Vicki? You're not coming through as clearly as Barbara. Okay. Well thank you for joining us, Marlene. So, I think one message for this group today, even though these are voluntary and may eventually get to State comparability, I think States will compare with each other if they want to, and I think it behooves us as a quality measurement scientific group to provide as much consistency in the measures as we can so that the States just don't willy-nilly have to develop their own measures, and then they will never be comparable. So I think that's one of the charges for us. I don't want us to think oh, this is all voluntary, so we don't need any consistency. I think we're on the path toward more transparency and consistency and rigor in measurement, and this meeting is part of that.
Ms. Dailey: And that was our experience in working with States for the CHIP program, since I think it was 2003 or 2004. We had four measures with clear specifications that we asked States to report on, and again, they basically had to change the specifications to meet either their eligibility systems, you know, the whole enrollment period issues that develop. They couldn't have a full year of criteria for some of their kids that were churning in and out, and over 5 years, we still weren't necessarily able to compare across those programs. And so I think that's why this is so critical to give them something in terms of expectation and a scientific nature behind it, you know, educating them why it's this way.
Ms. Wachino: Yes, I think one of the challenges we're going to have, having worked with Medicaid agencies for 13 years on measurement for accountability and for improvement, is that trying to have measures that all States can collect might compete with trying to have measures that actually can drive improvement. So what is measurable and what can actually be feasibly and sustainably done might not actually be what can drive improvement versus the measures that could actually drive improvement might be a longer track. So we'll have that balancing act. And thankfully, there's a Federal commitment to support the work that needs to be done.
Ms. Dailey: And through our reports to Congress, we can make recommendations going forward if there's additional need.
Dr. Dougherty: Yes. Thank you, Vicki. And I think we'll move on to the next set of speakers. As I said, Rita Mangione-Smith is going to provide us an overview of the challenges we faced during the identification of the initial core measurement set. You will also learn what the identification of the initial core measurement set and the SNAC [National Advisory Council on Healthcare Research and Quality Subcommittee] are because I've been flying these little acronyms around. So thank you, Rita.
Dr. Mangione-Smith: Good morning everybody, and thank you Denise for the opportunity to talk to this group about what was really an incredible experience last summer and fall. I think it's fitting that I have 10 minutes to tell you about it because we felt like we had about 10 minutes to do it. So I'm going to be talking to you about lessons learned in that process that the SNAC—we call it the SNAC, actually my co-chair came up with that name I think at 2:00 in the morning as we were producing slides for the NAC. He said, "SNAC CHIP, don't you think that's a good name for it?" I E-mailed back, "Yeah." So that's how the birth of the SNAC happened. So I want you to understand what a multidisciplinary process this was. I definitely want to take a minute to recognize my co-chair for the SNAC, Jeff Schiff, who was amazing to work with. Denise and her staff were incredible during the whole process, also Barbara and her staff. We had two NAC members, which is the AHRQ National Advisory Council for those of you who may not be familiar with health care quality at AHRQ, Tim Brei and Kathy Lohr. And then we had another 20 individuals who took part, some of whom are sitting in the room, and it's nice to get back together with them today. So lots of different areas of experience around the table.
So what was our charge? Our charge was first to do some of what we're talking about today, provide guidance on what criteria we should be using to identify the measures, or to assess the measures that we were going to recommend for this core set. So that was our first charge. Our second charge was to provide a strategy that we would use to try to find all the measures that we should be looking at, and then finally, come up with a strategy for applying the criteria we agreed on to those measures to then come up with our recommendations for the core set. So, our activities went between July and in all honesty the end of September. So it was a really, really short timeframe to get a lot of work done.
So this was our process, and I'm going to kind of go over this in broad brush strokes and focus more on sort of what we learned from the process, what were the problems, what might we try to do differently next time. So it started out with a big effort by AHRQ and the Centers for Medicare & Medicaid Services (CMS) to identify existing measures in use by Medicaid and CHIP. We then got together, the co-chairs and the Federal Quality Workgroup and decided that for our first meeting in July, it would be most useful if we could present the Subcommittee with those measures, with some definitions for criteria for evaluating those measures so that when we got to the meeting we would actually have some evaluated measures to talk about. So that's what we did.
We decided we would do a University of California, Los Angeles (UCLA) RAND modified Delphi process, and if anybody knows my history why we did that is probably obvious. Beth McGlynn's in the room, she taught me everything I know, and you know, that's the way I've been taught to evaluate measures. So that's the process that we decided to embark on. We took the validity and feasibility definitions that are traditionally used at RAND, but then we also worked together as a group to modify those definitions a bit for the process that we were about to go through, which was a little bit different than doing the kinds of panels that we do at RAND.
First, we sent all the measures to the Subcommittee, we sent them the criteria, and we said in a week, can you please get us all of your evaluations back? And they did, which was amazing. So we generated scores from that process and had our meeting. At that meeting, one of the very first things we did as our first charge was we looked at those criteria that we had just used for validity and feasibility, and we did a lot of tweaking. It was a great group, and they had a lot of great ideas. We did decide that the criteria we were using needed to have some changes made to them. We also added an additional criterion we felt was needed, which was importance of the measure, and worked to gain consensus on what that criterion—how that criterion should be defined.
The other big decision we made at that meeting was we needed to go beyond CHIP and Medicaid measures. Everybody felt very strongly about that. We would stick to measures in use, but we wanted to look at measures that were outside of what was being used in CHIP and Medicaid. We then went between the two meetings through the process of trying to find measures and come up with a process that would allow people to nominate measures, and I'll go into that in a little bit more detail. So we got a whole new group of measures, we applied our criteria again at our September meeting and went through some more ranking and voting and came up with the final 25 measures that were put up for public comment.
Okay, so what was our problem? I think I've already made it clear it was the short timeline. Already you can probably tell, there were some problems with how we went through our process. Our process required that we establish draft measure evaluation criteria for validity and feasibility before we ever even met as a group. So what did that result in? We ended up doing some rework. There was some inefficiency that was introduced because of that, because once we came to our own consensus criteria, we all felt like we needed to look at the measures again that we just scored because you know, we wanted all of the measures to be graded using the same criteria. So Delphi Round 1 was not a total waste of time. I think it gave everybody a chance to get their feet wet with using the Delphi process and applying criteria to measures, but it really was a practice round for the group.
So I'm not going to spend a lot of time talking to you about what we agreed on for definitions. Denise has given everybody a handout that basically outlines our definitions for validity, feasibility, and importance, but for the sake of time, I'm not going to go over them in detail right now.
This was a conceptual model that was actually suggested by one of the people who is on the subcommittee and sitting here in the room, Marina Weiss. She had this idea from all of our conversations the first day that what we were really talking about was this, all of our conversations seemed to revolve around this, that there were clearly some grounded measures, there were some intermediate measures that had been developed but had not been used extensively, and then there were the measures that we all wish we had, the aspirational measures. So as a group, the Subcommittee decided what we really wanted to make our scope for the core set were the grounded measures. We wanted to try to come up with a group of 10 to 25 measures currently feasible according to our definition and part of that definition, the one thing I will tell you in that feasibility definition, that I'll emphasize, is that it had to be a measure in use, and there had to be existing detailed specifications for the measure. The intermediate group, again, we don't know how many measures are out there that are like that. Some of them have good specifications, but they aren't being used as widely. And then aspirational measures, obviously the measures that need to be developed through this upcoming program.
Other important decisions that we felt about the scope of what we ended up with was that we really needed to be realistic about staffing and funding and all the needs for collecting, analyzing and reporting data at the State level. Given the economic crisis that we're in, this just was something that we had to keep at the forefront of our thinking. And I already mentioned that we were going to expand beyond the CHIP and Medicaid measures. So those were the main decisions that happened at our first meeting.
Between the meetings, AHRQ developed an online nomination template for measures. So people could go in and they could nominate measures. Many representatives from the Federal Workgroup did that. There were some that were actually entered by AHRQ from people in the public who wanted to enter measures. So basically when you went into that template, you had to enter information about feasibility, validity, and importance. It kind of guided you through the template asking you key questions to help us get at how well that measure met the criteria.
So the problem was that template became available online at the beginning of August, and we stopped accepting measures about 3 weeks into August. We had to do that because we had to have some time to synthesize that information for the Subcommittee meeting in September. So we got many incomplete submissions, unfortunately. Some of them had no specifications attached with their submission, didn't give us any evidence related to the scientific soundness of the measure, and presented incomplete information about some of our importance criteria.
So what did we do? We said okay, well, as much as we can we're going to try to fill in the gaps for the missing information. So Denise and her staff and my co-chair and I spent last summer trying to fill in the gaps. We looked for specifications on measures that had been nominated, we attempted to obtain information related to some of our importance criteria, especially around disparities and variation in care according to the measure. We did evidence reviews of the literature, trying to look for any evidence that supported a given measure and then also went through the process after we identified evidence of grading it using the Oxford Center for Evidence-based Medicine criteria which I think Denise also gave you a copy of what we were using. And then we decided the Subcommittee would go crazy if we didn't distill this even further, so we created these one-page summaries that basically told them the measure name, who owned the measure, the numerator, the denominator, the evidence that supported it, whether there was any validity testing on the measure that had been done, any reliability testing, and whether it met some of the importance criteria.
So we got to look at 119 measures, 119 one-page summaries, and despite all of our efforts there was still a lot of missing information. So for 22 percent of the measures, no specifications could be identified. There was no reliability data on about 50 percent of the measures; 24 percent of them, despite what we said at the very beginning of the nomination template, were not in use currently by anybody that we could identify. The evidence grades, about what we expect for pediatric quality measures. There's not a lot of randomized controlled trial data which is Level A evidence. A lot of outcome studies, cohort studies, that's Level B in the Oxford system. There was very little information about variation in performance on the measure or disparities. Less than half of the nominators provided that. We couldn't find it. So we had our second Delphi process. We had more information on the proposed measures than we did the first time around, but it was still pretty incomplete.
We had 1 week to assess summaries on 119 nominated measures before the meeting, and we decided as a group to adopt a philosophy, and this actually came from the first meeting, that we preferred to leave an empty chair rather than to try to find a measure, no matter how weak it was, or how little evidence there was for its validity. So, here are some of the empty chairs. These are things that are called for in the legislation, areas of measurement that they really want, you know, wanted in the core measurement set that we honestly just could not find measures that we felt good about recommending. Those included things that have to do with measuring the medical home, most integrated health care systems, and you can see down the list, duration of care, inpatient care. Mental health care was a big one where we really just had a very hard time finding good measures.
So what lessons did we learn? If I was going to do this again, here's how I would do it. I would start with the nomination process before we ever met, I would want more time for a few reasons. First of all to obtain nominations from people because I think we missed out on some potentially good measures because of that truncated timeline, more time to evaluate and summarize the nominations and have more than like five of us trying to fill in all the gaps would have been really helpful, and then more time for the Subcommittee to really take in the information and do their scoring. I just felt it was so rushed that it was really hard for people to make good sound assessments about the measures.
I think it probably would have been smart to reach consensus on our evaluation criteria before we tried to grade measures—that would have prevented a lot of rework on the part of the Subcommittee. And I've come to the conclusion even if we had had years, the process was never going to be perfect. So I'm going to stop there because we have other people who need to talk.
Dr. Dougherty: And we'll hear these three presentations and then have some Q&A. Dr. Ernest Moy has been with AHRQ for about 10 years now. He came to us from the AAMC, the American Association of Medical Colleges, if I have that acronym correct, and he was working on disparities there. And since he's come to AHRQ, he's been working on the AHRQ-produced U.S. Department of Health and Human Services congressionally-mandated national health care quality and disparities reports. So he's going to share his wisdom in trying to sort out which measures are good enough for those reports to Congress. Thank you, Ernest.
Dr. Moy: Thank you, Denise. I want to start off with that last comment about having forever and the process still being imperfect and the outcome still being imperfect. I can certainly reinforce that, since our activity is a little bit different. Ours relates to selecting measures and a measure set for a national health care quality report and disparities report, and this involved picking measures and developing measure sets so there's that commonality, but there's some differences as well. And so the message that I'm trying to relate to you is certainly not to do it our way because we've been working on our measure set now for over 10 years, and it is still imperfect, and we still add to it every year, and we still refine it every year. So, probably my message would be not to do it our way unless I wanted Denise to kill me. But some of the things that we explored are probably things that might be of interest to this group, and so I just wanted to tell you our story.
This is kind of what we went through and some of the lessons that we learned. Two seconds on what the reports actually are. The long and short of it is we were asked by Congress through authorization in 1999 to produce these two reports, the National Healthcare Quality Report, which we view as a summary and trends in quality of care in the Nation, and the National Healthcare Disparities Report, which focuses on disparities related to race, ethnicity, and socioeconomic status. Operationally, what does that mean? Well, that means the first five Institute of Medicine (IOM) categories: effectiveness, safety, timeliness, patient-centered, and efficiency are in the quality report, and all of equity is in the disparities report. Operationally this also means the disparities report is several times larger than the quality report.
I wanted to talk about some of the similarities and differences perhaps between our endeavor and this endeavor. We had a number of advantages, I think, first of all, because this was given to us by law. So we kind of knew or we had some insight into what folks wanted from us, whereas this group might have a little bit less insight I think. And so these were some of our assumptions. First of all, the law did specify some of the things we were supposed to do. So for instance, the disparities report said look at racial, ethnic, and socioeconomic disparities, not other disparities.
Then we could make some intelligent assumptions about other things, things that were important for us to specify in criteria for measures and for measure set. First and foremost was our primary audience. So we knew our primary audience as Congress, and so we knew that we were supposed to look at the national level, provide this big picture kind of reporting as opposed to other ways of looking at this information.
Secondarily, we knew that our analytic unit then was the Nation or other geographic units, not the provider, and so that's the big advantage. We did not have to incorporate the aspect of accountability. I think someone raised that issue, you know, you want something accountable, you're going to have different criteria. We didn't have to deal with that. We did not have accountability criteria because we were not going to be looking at providers. Third, our primary purpose, our primary use was viewed a priori as national tracking. So how are we doing, what's the direction we're going, maybe providing some geographic benchmarks, not quality improvement, not pay-for-performance, not public reporting. So again, we didn't have to deal with those aspects of developing a measure or measure set that relate specifically to that, and we don't have criteria necessarily that involve those topics. We also had a number of constraints, annual reporting, using extant data only, and these obviously entered very strongly into our assessment of feasibility.
Our process was perhaps similar to the CHIPRA process but probably also a little bit different. We also had a call for measures. The time period for this, this is 1999 until 2002 when we really started working on the first set of reports. We had a call for measures. We got over 600 nominations from different kinds of organizations, and at that point we'd had a Federal interagency workgroup work with these measures. This was an internal process, not an external process. Again, a little bit different probably, maybe a little bit more streamlined as a consequence. This group whittled those 600 measures down to a smaller set. This set was then published in a Federal Register notice and solicited input into this initial measure set. This was further tweaked. We also had the National Committee for Vital and Health Statistics have public hearings about this draft measure set to solicit input, and from this we got what we finally came up with.
Another building block that we had available too, which we had available to us—I don't know if you have something like this available to you—was the IOM framework. And so a priori again our organization said we're going to use the IOM framework for quality of care, and it pretty much arrays the major dimensions, domains of quality of care affecting the safety, timeliness, and patient-centeredness against patient perceptions of care, staying healthy, getting better, living with illness or disability, and coping with the end of life. So that was the initial framework for the quality report.
The IOM also told us to then to take that and look at it for disparities. They never graphed this out to see what that actually looked like, and when we graphed it out this is kind of what we made of it, and it also made us appreciate how big an undertaking that disparities aspect really is. You take this matrix of all these measures, this square as it were, and now we're going to array it by race. We have more than two racial groups, so it's multiple racial comparisons, multiple ethnic comparisons, multiple socioeconomic comparisons. They get this cube. And so that was our framework. Not necessarily the most elegant thing, but it gives you an appreciation of what work is involved with incorporating a disparities element, something that I think this group wants to do.
In addition, we had to add the dimension of access to care for the disparities report because you can't really look at disparities. Someone had made that comment earlier, you can't differentiate in access and quality from a disparities perspective, and so we had to build this access concept. And then in writing of course this is all supported by and driven by health care needs.
So those are the building blocks that we had, and these are the criteria that we applied. We started off with the IOM's recommendations, their criteria for measure selection, scientific soundness, and feasibility, and I think at least those latter two are the foci of the breakouts for this group. But in addition, because we're the Federal Government, we wanted to be consistent with existing consensus-based measures where possible, and so we relied heavily on other things, both inside and outside the Federal sector. So a strong preference for Healthy People 2010, National Quality Forum (NQF), other kinds of consensus-based measure sets. And that's pretty much our criteria for measure selection. And our outcome was basically about 150 measures of quality of care, and then using a similar process we got about 50 measures of access to care, and that's what we started off with.
Well, about a year or two after we came out with the first report, we said this is not the most optimal of outcomes. Picking a measure set based purely on the measure selection criteria had a couple of problems. First of all, we had this huge measure set, over 200 measures because people would say well, this measure is as good as that measure so you should include that one as well, and this is much too large really for reporting. It was much too large for communicating this information in any effective manner. Secondly, and I heard the balance term used a lot, it was unbalanced. Picking the measures purely on their individual measure criteria resulted in a very unbalanced measure set, often difficult to interpret. So we would have multiple different measures for a particular topical area, and it's kind of hard to explain to a policymaker what to make of it—there are three up and four down, and so on—they don't want to hear that. They want to hear a single message, so it was very difficult to interpret.
Lastly, this measure set had a large number of measures that weren't applicable to disparities. So we went through this process of looking at the quality of the measure and we wound up with a number of measures for which there was no disparities information, not usable therefore in the disparities report.
So we wanted to rework it, and in reworking these were our goals. One was to get a core measure set that was smaller, and so this was what we ultimately operationally figured out was feasible. We can report on about 50 measures every year and actually, we think, say something reasonable about them, track them every year, and people can know that they're coming up and know what to expect and have some knowledge about that number of measures, not more. We wanted to balance out the measure set, and I'll talk about some of our balancing criteria. We wanted to emphasize understandability and for policymakers, understandability meant summarization and composites. We had a number of expert panels helping us with that process. But that's the direction we went in intentionally. And we wanted to make everything usable in the disparities report, so everything ought to have a disparities analogue.
So, we went to Phase II measure set selection criteria, and again we got our whole Federal interagency workgroup to review our measure set. And we've done this pretty much every year, continually adding to it, refining it. And the criteria we used obviously have grown, naturally, as our measure set has. So we still use the IOM criteria, importance, scientific soundness, feasibility, we still try to maximize consistency with what others put out there so that people don't get conflicting information, and these are new kinds of criteria that we considered in looking at the measure set as well as individual measures. We focused on issues that were of high utility for directing public policy, and for us that means things where we think that there actually is a driver there. We looked at things that potentially were sensitive to change, and that was a bias. That favors then processes that are changeable over outcomes, which are more difficult to change. Ease of interpretation, applicability to overall population, data collected regularly and recently, ability to do multiple disparities as opposed to only one or two, and ability to support multivariate modeling. That's one of the things that came out to us in our disparities report was that certain constituents really wanted to be able to do multivariate models to isolate the specific racial or socioeconomic disparities effect. So these are some of the new measure criteria we considered in improving our measure set.
In addition, we had a whole series of balance criteria. So these are not looking at measures one by one but instead stepping back and looking at our whole panel, our entire measure set, and these are some of the things that we've tried to balance across. One, balance across quality domains. When we started out, we still were very, very highly focused on effectiveness, but we think we've been able to enrich our safety, timeliness, patient-centeredness, and equity measures over time. Another was even though process measures are more actionable, we intentionally wanted to balance process and outcomes measures because many policymakers are interested in the outcomes, and preferably if the process and outcomes can be linked as in the CHIP criteria, that was the best story that could be delivered. We wanted to balance across different kinds of health conditions. We wanted to balance across different sites of care. We intentionally wanted different types of data. The notion here was that every kind of data has its own kind of limitation, its own potential biases, and so we felt the most comfortable operationally when we had different kinds of data, administrative data, surveys, and clinical data telling the same story, and we actually felt the most confidence about that assessment of quality.
We wanted to include at least some measures that had State data and some measures that allowed for multivariate models. Even after 10 years, there were a couple of different criteria that are still unresolved for us. One, we still took the ultimate model of taking a quality measure set and then looking at it across different populations to assess disparities. We did not include specific measures for specific populations. So for instance, if you have one population that has a very important kind of topical area that isn't applicable to the general population, we don't have measures that are like that. We also don't have explicit measures of disparity itself. We're always taking quality measures and then looking at differences as our measure of disparity. We know this is suboptimal, but these are just simply unresolved issues, even after 10 years. So that's our process and hopefully it might be helpful to you.
Dr. Dougherty: Thank you. So you had a lot of the same issues we dealt with last summer. And if people want to see the good—actually, it's a very good measure set I think, not to bias your public comments that you're going to send in, but in the back of your briefing book you can see the 24 measures that are actually out for public comment that came out of all that work that Rita described. So now I'm going to ask Helen, are you ready?
Dr. Burstin: I'm on the line.
Dr. Dougherty: Okay, great. I'm going to try to find—this is Helen Burstin who is the Vice President for Performance Measurement, is that still your title, Helen?
Dr. Burstin: Vice President of Performance Measures at NQF, yes.
Dr. Dougherty: At NQF, okay. And she is going to speak through the phone.
Return to Contents
Proceed to Next Section