Page 1 of 1

Chapter 2

Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for Medicare Beneficiaries

2. Methods and Data1

2.1 Improving the Race/Ethnicity Coding of Medicare Beneficiaries

History of EDB Race/Ethnicity Coding. The race/ethnicity code on the Medicare EDB is obtained from the Social Security Administration's (SSA's) master beneficiary record (MBR). From 1935 to 1980, the Social Security application form (SS-5) only allowed classification of a person's race into "White," "Black," or "Other" categories. In addition, "Unknown" was used to classify persons who did not report any race. The codes from the SS-5 were incorporated into the MBR. The number of race/ethnicity categories on the SS-5 form was expanded in 1980 to six: "White (non-Hispanic)" "Black (non-Hispanic)" "Hispanic" "Asian, Asian-American, or Pacific Islander" "American Indian or Alaska Native" and "Unknown." In 1989, the SSA began to enroll new participants at birth, extracting data from birth certificates rather than requiring applicants to file form SS-5 however, the race/ethnicity information on the birth certificate was not included in the data extraction because it was considered unnecessary for the administration of the SSA program. Since 1989, the only persons filing an SS-5 form have been those requesting a new number or a name change (Scott, 1999).

In 1994, race data from the SS-5 forms with the expanded race/ethnicity codes were integrated into the Medicare EDB in an effort to correct erroneous codes and fill in missing ones. This action changed the race/ethnicity coding for more than 2.5 million beneficiaries (Lauderdale and Goldberg, 1996). This update using the SS-5 form with the expanded race/ethnicity codes was conducted again in 1997 and 2000, and has been conducted on an annual basis since then. The Medicare program has also been working with the Indian Health Service to improve the coding of American Indians and Alaska Natives.

To correct miscoded data and further reduce the amount of missing race/ethnicity information, in 1997 the Health Care Financing Administration (now the Centers for Medicare & Medicaid Services, or CMS) conducted a postcard survey of nearly 2.2 million beneficiaries. Included in the survey were beneficiaries with: Hispanic surnames, Hispanic countries of birth, or beneficiaries coded "Other" or missing race/ethnicity data. The survey resulted in code changes for approximately 858,000 beneficiaries (Arday et al., 2000). These efforts clearly improved the EDB's race/ethnicity data. Nonetheless, comparisons of the EDB race/ethnicity codes to the self-reported race/ethnicity from the Medicare Current Beneficiary Survey (MCBS) indicated that identification of Hispanics, Asians/Pacific Islanders, and American Indians/Alaska Natives was still incomplete and might result in biased analyses involving these groups (Arday et al., 2000; Eggers and Greenberg, 2000; Waldo, 2005).

Assessment of the Accuracy of Race/Ethnicity Coding on the EDB. The accuracy of the Medicare EDB race/ethnicity code was further assessed by the RTI researchers working on the recently completed Health Disparities project referred to earlier. This assessment consisted of a comparative analysis of the EDB race/ethnicity code with self-reported race/ethnicity data obtained from 830,728 Medicare beneficiary respondents to the 2000-2002 Medicare CAHPS surveys. The analysis investigated the accuracy of the six race/ethnicity classifications used in the EDB race/ethnicity code (non-Hispanic White, non-Hispanic Black, Hispanic, Asian/Pacific Islander, American Indian/Alaska Native, and Unknown/Other). The measures calculated and presented in Table 2.1 to assess the accuracy of the EDB codes include: sensitivity,2 specificity,3 positive predictive value4 (PPV), negative predictive value5 (NPV), and the Kappa6 coefficient of inter-rater reliability.

Relative to self-reported data, the accuracy of the EDB was greatest for non-Hispanic Black Medicare beneficiaries: sensitivity was 97.4 percent, specificity was 98.8 percent, PPV was 86.3 percent, NPV was 99.8 percent, and a Kappa coefficient of 0.91 was observed. Non-Hispanic White beneficiaries were the next most accurately identified group on the EDB. Sensitivity was high (99.3 percent), but specificity was just 61.7 percent, suggesting that a sizeable proportion of beneficiaries who were not White were incorrectly coded as White. The PPV and NPV were 91.7 and 95.7 percent, respectively, but the Kappa coefficient was only moderately high at 0.71, reflecting the lower level of specificity. Sensitivity for American Indian/Alaska Native beneficiaries was very low at 35.7 percent, and the PPV was low at 59.9 percent. Specificity and NPV for this group, however, were exceptionally high at 99.9 and 99.7 percent, respectively. The low Kappa coefficient of 0.45 reflects the low sensitivity of the EDB for this group.

Table 2.1. Accuracy and agreement between EDBRACE and SELFRACE

Race/ethnicityEDBRACEAccuracy and agreement measures for EDBRACE
SELFRACEYesNoSensitivitySpecificityPositive predictive valueNegative predictive valueKappa
WhiteYes667,5734,42099.3%61.7%91.7%95.7%0.71
No60,79497,941     
BlackYes57,8671,51597.498.886.399.80.91
No9,209762,137     
HispanicYes12,95330,97429.599.992.796.20.43
No1,025785,776     
A/PIYes8,0086,62654.799.884.599.20.66
No1,469814,625     
AI/ANYes1,1942,15035.799.959.999.70.45
No799826,585     
Other/UnknownYes47827,1581.798.84.996.70.01
No9,357793,735     

Source: EDBRACE is from the mid-2003 Medicare EDB and SELFRACE is from Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002. Table taken from the final report for CMS Contract Number 500-00-0024, Task 8 and has been reprinted for this report from the project final report, Health Disparities: Measuring Health Care Use and Access for Racial/Ethnic Populations dated April 2005.

The focus of the project, however, was on Hispanic and Asian/Pacific Islander beneficiaries because earlier research had shown that the sensitivity of the EDB was especially low for these groups. Indeed, sensitivity of the EDB for Hispanic beneficiaries was only 29.5 percent, but specificity (99.9 percent), PPV (92.7 percent), and NPV (96.2 percent) were very high. The Kappa agreement coefficient of 0.43 reflected the low level of correct identification of Hispanic beneficiaries on the EDB represented by its low sensitivity. The situation on the EDB was somewhat better for Asian/Pacific Islander beneficiaries. Here, sensitivity was 54.7 percent, correctly identifying only slightly more than half of this group. Specificity and NPV were both very high at 99.8 and 99.2, respectively. Even the PPV was respectable at 84.5 percent, and the Kappa coefficient at 0.66 was only slightly lower than for White beneficiaries, likely reflecting the lower sensitivity.

Return to Contents

2.2 Development of the Algorithm

In light of the low sensitivity of the Hispanic and Asian/Pacific Islander race/ethnicity categories on the EDB, we employed a multi-stage process through which separate Hispanic and Asian/Pacific Islander imputation algorithms were developed. These algorithms used several pieces of information on the EDB including:

  1. A variable that identified the language a beneficiary preferred CMS use when sending the Medicare Handbook. English, Spanish, and blank (no preference specified) were the only allowed values. This variable is referred to as LANGPREF.
  2. A variable that identified the language a beneficiary requested the Social Security Administration (SSA) use when sending beneficiary notices. This variable was used by CMS for Medicare premium bills. English (for Puerto Rican zip codes only), Spanish, and blank (English assumed for non-Puerto Rican zip codes and Spanish assumed for Puerto Rican zip codes) were the only allowed values that HCFA supports. This code is referred to as LANGCD.
  3. A variable that identified the source of a beneficiary's race/ethnicity code on the EDB (EDBRACE). This variable is referred to as RACESRC. Three values are allowed:
    A = Response from a one-time survey that was mailed to certain beneficiaries in 1997
    B = Indian Health Service
    Blank = Social Security Administration—Master Beneficiary Record (SSA-MBR) or SS-5 (NUMIDENT) or Railroad Retirement Board (RRB)
  4. A variable that identified the state in which a beneficiary resides. We identified beneficiaries living in Hawaii and Puerto Rico.

The algorithms also used Hispanic (Word and Perkins, 1996) and Asian/Pacific Islander (Falkenstein and Word, 2002) surname lists developed by the U.S. Census Bureau.; In the Hispanic surname list, Word and Perkins assign a percentage to each name representing the proportion of times a household headed by an individual with a particular Hispanic surname was indeed in an Hispanic household as reported to the Census.; Falkenstein and Word had similar percentages for the Asian/Pacific Islander surname list.

We incorporated these pieces of information into a SA program that, through an iterative process which differed slightly for Hispanics and Asians/Pacific Islanders, created an improved race/ethnicity variable (NEWRACE). The logic of the algorithms is described below.

A beneficiary was considered Hispanic (or Asian):

  1. If the surname algorithm identified the beneficiary as Hispanic (Asian) at the stated inclusion level of 70 percent.
  2. Otherwise, if the EDB coded the beneficiary as Hispanic (Asian).
  3. Otherwise, if the person was a resident of Puerto Rico (Hawaii).
  4. Otherwise, if the variable LANGCD indicated Spanish.
  5. Otherwise, if the beneficiary's first name had Hispanic (Asian) origins, and the surname, at the 50 percent inclusion level, identified the beneficiary as Hispanic (Asian).

A beneficiary was considered not Hispanic (Asian):

  1. If not identified in the above steps.
  2. Otherwise, if the variable LANGPREF indicated English.
  3. Otherwise, if the variable RACESRC indicated the EDB's race code came from the 1995 survey and the EDB's race code is not "Hispanic," ("Asian").
  4. Otherwise, if the variable RACESRC indicated the beneficiary's EDB race code came from the Indian Health Service

Return to Contents

2.3 Assessment of the Algorithm

Using the self-reported race/ethnicity data from the 2000-2002 Medicare CAHPS survey respondents as the gold standard, we assessed the results of applying the algorithms to the CAHPS respondents (i.e. the NEWRACE variable). We found the algorithms significantly improved the race/ethnicity categorization of Hispanic and Asian/Pacific Islander Medicare beneficiaries. As can be seen from Table 2.2, among Hispanic beneficiaries, sensitivity was 76.6 percent (improved from 29.5 percent), the Kappa coefficient was 0.79 (an increase from 0.43), and the other measures (specificity and predictive values) remained virtually unchanged. The improvement for Asian/Pacific Islander beneficiaries was equally impressive – sensitivity rose to 79.2 percent (from 54.7 percent), Kappa increased to 0.80 (from 0.66), and the other measures were not materially changed. Analysis of the improvements indicated that among both groups there were somewhat more males correctly identified than females (possibly due to intermarriage and surname changes for ethnic females), and more 65-74 year olds than those older than 74 (probably because there are more beneficiaries in the younger age group).

Table 2.2 Accuracy and agreement between NEWRACE and SELFRACE

NEWRACEAccuracy and agreement measures for NEWRACE
 SensitivitySpecificityPositive predictive valueNegative predictive valueKappa
Hispanic79.299.781.599.60.80
Asian/Pacific Islander76.699.284.598.70.79

Source: NEWRACE is from the algorithms developed by RTI, and SELFRACE is from the Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002. Table taken from the final report for CMS Contract Number 500-00-0024, Task 8 and has reformatted for this report from the project final report, Health Disparities: Measuring Health Care Use and Access for Racial/Ethnic Populations dated April 2005.

After demonstrating the clear advantage of using the Hispanic and Asian/Pacific Islander algorithms to improve the race/ethnicity categorization, the algorithms were combined and applied to all of the active records in the mid-2003 unloaded EDB.


1 The assessment of the EDB race/ethnicity data, creation of the algorithm for imputing Hispanic and Asian/Pacific Islanders, selection of the sample of 1.96 million Medicare fee-for-service beneficiaries, and geocoding of beneficiary addresses discussed in this section was performed by RTI under CMS Contract Number 500-00-0024, Task 8 and has been adapted for this report from the project final report, Health Disparities: Measuring Health Care Use and Access for Racial/Ethnic Populations dated April 2005.

2 The percentage of persons who self-reported themselves to be of a particular race/ethnicity who are coded as being of that race on the EDB.

3 The percentage of persons who self-reported themselves not to be of a particular race/ethnicity who are coded as not being of that race on the EDB.

4 The percentage of persons coded in a particular race/ethnicity category on the EDB who really were of that race according to their self-report.

The percentage of persons not coded in a particular race/ethnicity category on the EDB who really were not of that race according to their self-report.

6 Kappa measures agreement between two independent race/ethnicity codes for the same person being coded, in this case between the self-reported and EDB race/ethnicity codes, where a coefficient of 1.00 represents perfect agreement and 0.00 is an absolute lack of agreement.


Return to Contents
Proceed to Next Section

Page last reviewed January 2008
Internet Citation: Chapter 2: Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for Medicare Beneficiaries. January 2008. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/research/findings/final-reports/medicareindicators/medicareindicators2.html