Page 1 of 1

Chapter 2a

Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for MEdicare Beneficiaries - Chapter 2a

2. Methods and Data (continued)

2.4 Using the Algorithm to Provide an Improved Race/Ethnicity Variable

Upon combining the Hispanic and Asian/Pacific Islander naming algorithms and verifying the combined algorithm's success on the CAHPS data, we created the NEWRACE variable for the entire Medicare population found in the EDB. The first step was to obtain from CMS all 41.7 million records of active beneficiaries in the 10 segments of the unloaded EDB from mid-2003. After we had uploaded the EDB records, we were able to run the algorithm on the EDB records creating NEWRACE for each living beneficiary in the EDB.

Table 2.3 demonstrates the differences in the EDBRACE and NEWRACE variables for the entire population of active beneficiaries listed in the EDB. The number and percentage of Hispanic and A/PI beneficiaries increased, while they decreased for the White and Other race/ethnicity categories. The number and percent of Black beneficiaries also decreased slightly.

Table 2.3 Comparison of the distribution of race/ethnicity according to EDBRACE and NEWRACE for the entire EDB

 Original EDB race variable (EDBRACE)New EDB race variable (NEWRACE)
Asian/Pacific Islander (A/PI)593,4561.4854,1822.0
American Indian/Alaska Native (AI/AN)137,9890.3136,4980.3

Source: EDBRACE is from Medicare EDB from mid-2003; and NEWRACE is the result of having run the combined surname algorithm on race/ethnicity in the Medicare EDB from mid-2003.

Table 2.4 shows that overall, 1,998,9097 beneficiaries listed in the EDB had their race/ethnicity recoded to Hispanic as a result of using the combined improved naming algorithm. Most of these beneficiaries were originally classified in the EDB as White (83.5 percent), followed by Other/Unknown (11.1 percent), and Black (3.8 percent). Very few beneficiaries were originally coded as Asian/Pacific Islander (1.5 percent) or American Indian/Alaska Native (less than 0.05 percent). Overall, more female beneficiaries (1,068,033) than males (930,875) were recoded to Hispanic. This pattern holds true for White, Black, and Asian/Pacific Islander beneficiaries. The largest number of "new" Hispanic beneficiaries was created in the 65-to-74-year-old age group. This is true regardless of the beneficiaries' original EDB race/ethnicity code and gender. Not surprisingly, the 85-year- old-and-older age group had the fewest beneficiaries with their race/ethnicity recoded. This undoubtedly reflects the overall age distribution of Medicare beneficiaries.

Table 2.4 Distribution of "new" Hispanic beneficiaries (NEWRACE) according to their EDBRACE, gender, and age group

Pacific Islander
American Indian/
Alaska Native
Other or unknownTotal
Gender and
age group
  Under 65170,15577.910,6504.91,7890.82870.135,50116.3218,382100.0
  85 and  Older48,69080.92,5064.28591.490.08,10613.560,170100.0
  Under 65144,23580.48,9475.01,5390.92230.124,46113.6179,405100.0
  85 and  Older95,35382.34,8854.21,2761.1180.014,35112.4115,883100.0

Source: EDBRACE is from Medicare EDB from mid-2003; and NEWRACE is the result of having run the combined surname algorithm on race/ethnicity in the Medicare EDB from mid-2003.

As can be seen from Table 2.5, among Asian/Pacific Islander beneficiaries, 290,7488 were recoded as a result of using the combined improved naming algorithm. Unlike the Hispanic beneficiaries whose race/ethnicity was most often originally coded in the EDB as White, the majority of the new Asian/Pacific Islander beneficiaries were originally coded as Other/Unknown in the EDB. Exactly 82.0 percent of the newly coded Asian/Pacific Islander beneficiaries were originally coded as Other/Unknown. In addition, 16.4 percent were originally coded in the EDB as White, 1.5 percent as Black, and 0.2 percent as American Indian/Alaska Native. Note that we did not recode any beneficiaries to Asian/Pacific Islander who were originally coded as Hispanic in the EDB.

Table 2.5 Distribution of "new" Asian/Pacific Islander beneficiaries (NEWRACE) according to their EDBRACE, gender, and age group

EDBRACEWhiteBlackAmerican Indian/
Alaska Native
Other or unknownTotal
Gender and
age group
  Under 652,39211.64731.1490.29,80987.212,723100.0
  85 and
  Under 654,26336.05965.0400.36,94758.611,846100.0
  85 and

Source: EDBRACE is from Medicare EDB from mid-2003; and NEWRACE is the result of having run the combined surname algorithm on race/ethnicity in the Medicare EDB from mid-2003.

With respect to gender and age, the Asian/Pacific Islander recodes were very similar to the Hispanic recodes. Across original EDB race/ethnicity and age groups, with the exception of the Asian/Pacific Islander group under 65 years of age, more females have been recoded to Asian/Pacific Islander than males. Overall 155,744 females were recoded compared to 135,004 males. As with Hispanic beneficiaries, the group of Asian/Pacific Islander beneficiaries 65 to 74 years of age was recoded most, while the group 85 and older was recoded least.

Overall, the combined improved naming algorithm recoded the race/ethnicity of 2,290,027 Medicare beneficiaries. Females and those 65 to 74 years of age were most often recoded to a new race/ethnicity when we used the combined improved naming algorithm on the full 10 segments of the unloaded EDB. For the new Hispanic beneficiaries, more were originally coded as White, compared to new Asian/Pacific Islander beneficiaries who were most often originally coded as Other/Unknown.

Return to Contents

2.5 Geocoding Beneficiary Addresses to Link SES Data from the Census to the Beneficiaries in the EDB

Geocoding refers to the process of assigning a code number to each Medicare beneficiary's address that allows it to be linked to the U.S. Census data that describes characteristics of the beneficiary's place of residence. The primary reason to geocode the address of Medicare beneficiaries in the EDB is to enable the association of geographic-based U.S. Census measures of socioeconomic status (SES) with the beneficiaries, as there are now none on the EDB. While U.S. Census SES measures are not individual-level measures, they can be aggregated to specified geographic units, such as the census block, block group, tract, county, or state, that are associated with every beneficiary. We wanted to geocode beneficiary addresses so we could use the socioeconomic characteristics of their neighborhood (block group) to impute their SES. Examples of the SES characteristics from the Census that we chose to associate with Medicare beneficiaries were the median household income, the percentage of the population unemployed, the median value of owner occupied homes, and the percentage of the population below the federally-defined poverty level. Such characteristics can be used individually to examine the effects of SES or be combined in some way to more fully represent the concept of SES. As was discussed earlier, one of the objectives of this project was to create a multi-component measure of SES. The details of Census geography and related data elements are described more fully in the U.S. Census Bureau's Geographic Area Reference Manual located on-line at

Return to Contents

2.5.1 Address Cleaning

In order to link the beneficiaries in the EDB to the Census information available for the beneficiaries' residential area, there must be something in common on both records. The U.S. Census data is identified by a federal information processing standard (FIPS) code that can identify values for areas as small as blocks and block groups for the SES data in which we were interested. The beneficiary's residential area on the other hand is identified by an address. We needed some mechanism for efficiently translating the addresses in the EDB to FIPS codes that corresponded to those in the Census. We obtained a computer database product from GeoLytics Incorporated of East Brunswick, New Jersey — GeoCode program 2003 Version 1.02 - that was promoted by the manufacturer as being able to correctly assign FIPS codes to the level of Census blocks to addresses that were read into it.

Address information on Medicare beneficiaries is stored in the EDB in six address fields, each with a length of 22 characters. These address fields are generic, and labeled ADDRESS1, ADDRESS2, etc., and thus there is the potential for great variation in the type and order of information contained within the address fields. Upon examination, it appeared that the six fields were simply filled from left to right with whatever information had been collected about the beneficiary's address. The one exception was the beneficiary's zip code, which was always stored in the RESZIP field. However, the GeoLytics GeoCode program product requires that the beneficiaries' address input files be formatted in the following way:


The GeoCode program requires that STREET contains the street number and street name, separated by a space, with street name followed by a comma; then city followed by a comma, and then the two-letter state postal abbreviation code, a space, and the five digit zip code. It was a challenge and extremely time-consuming to extract, validate, and format these four pieces of information from the EDB address fields so they could be used as input for the GeoCode program. To meet this challenge, we developed the following procedures to apply to the EDB records:

  1. Identify, for each beneficiary, what information is contained in each EDB address field.
  2. Extract the necessary information from the address fields, and create separate street, city, state, and zip code variables.
  3. Verify that street, city, and state variables contain the information they are supposed to, check that the information is in the correct format, and, if not, put it in the correct format.
  4. Output a text file (an ASCII text file, *.txt) in the proper format required as input for the GeoCode program.
  5. Run the GeoCode program.
    1. Input the address text file.
    2. Output.
      1. A text file summarizing the results of the address matching program.
      2. A database file (*.dbf) containing block IDs, error and accuracy codes, and other information related to the matched addresses.
  6. Import the database file (*.dbf) into SAS, which transforms the *.dbf file to a *.sas7bdat file.
  7. Merge the full transformed address file back onto the EDB records. This step adds a US Census-based geographic identifier (a string of FIPS codes) to each person-level beneficiary record.

This process was used to geocode the 10 separate segments of the unloaded EDB. The final step in the process allows the EDB to be linked to Census data files using the block group FIPS code that is common to both.

Time and resources did not permit us to identify and perform all of the necessary address preparation and verification activities manually on all 41 million-plus beneficiaries in the EDB. Instead, we used a random sample of addresses to identify incorrect patterns present in the beneficiaries' addresses in the EDB. Thus, we took a smaller batch of EDB records, specifically those EDB records corresponding to the 830,728 beneficiaries who responded to the CAHPS surveys we used earlier to develop the algorithm to improve on the EDB race/ethnicity coding to identify the various patterns exhibited in the EDB address fields. We developed SAS programs to extract, reformat, and validate the address information we needed, and then tested the performance of the GeoCode CD program. The following are the steps we performed to get the addresses from the EDB in good enough shape to run through the GeoCode program.

Identify and extract the information in each address field. EDB address fields could potentially follow many different patterns, and some did contain a good deal of superfluous or invalid information. Fortunately, the majority of records did follow a standard pattern:

  1. ADDRESS1 contained the beneficiary's street address – both the street number and the street name. In some cases, this field also contained a direction (e.g., "East 1st Street," or "E 1st Street," or "1st Street E"), and/or an apartment number.9
  2. ADDRESS2 contained either the beneficiary's city and state of residence or the beneficiary's apartment number
  3. ADDRESS3, in cases where the ADDRESS2 field contained the apartment number or the like, contained the beneficiary's city and state of residence.
  4. The last field with non-missing data typically contained the city and state of residence. So, in most cases, address fields 4, 5, and 6 were blank; a lesser number of cases had a blank for address field 3 as well.

The SAS program we wrote set the variable STREET equal to the EDB address field that should contain the street address (typically ADDRESS1). It also extracted separate CITY and STATE variables from the EDB address field that contained the city and state.

The RESZIP field in the EDB data contains the 9-digit Zip code. The SAS program dropped the last four digits of the EDB RESZIP variable, and created a new variable with the 5-digit Zip code (ZIP).

Verify the values and formats of STREET, CITY, and STATE. The first part of this step was completed prior to running addresses through the GeoCode program search engine. To verify that STREET and STATE contain the correct data, the SAS program checked for two things:

  1. That the string of characters contained in the new variable, STREET, actually started with a number. This does not provide 100 percent verification, as it is possible for the string of characters contained in the variable STREET to start with a number, but not be an actual street address. However, this step does help ensure that STREET contains a street address.
  2. That the string we identified as the state of residence (the new variable, STATE) was a valid two letter state postal abbreviation.

At this point, the STATE and ZIP variables were considered finalized. The remainder of the SAS algorithm focused on cleaning the STREET variable and ensuring that it was in the proper format. Before cleaning STREET, we dropped any cases where the GeoCode program would be unable to make a match, and for which we could obtain a match simply by reformatting the data. Dropped were addresses where:

  1. The street address was missing.
  2. The beneficiary's state was invalid (as indicated by an invalid two letter state postal abbreviation which was often a foreign country), or they lived in Puerto Rico.10
  3. If the beneficiary's address was a rural route, an RFD, a P.O. Box, or Box number.

For the remaining cases, CITY appeared to be relatively clean, and we did not attempt to reformat or validate that particular variable subsequent to dropping the cases listed above. Approximately 12.5 percent of the EDB records were dropped by this point, leaving us with about 87.5 percent of the records to which we applied further cleaning algorithms.

At this point, we began an iterative process of running small samples of the Medicare CAHPS survey addresses through the GeoCode address-matching process, identifying format-related problems in the street address field, and developing SAS code to repair the problems. Based on this testing process, we developed a series of six11 "fixes," all of which were targeted to reformat specific anomalies that occurred regularly in the street address field. These fixes made repairs related to three basic elements of a street address that caused the address matching program to fail to find a valid match for what is a valid address:

  1. Street address fields sometimes contained apartment, suite, lot, or unit numbers. While these are valid for mailing, the GeoCode program will return an error (i.e., "street not found") on an address containing one of these numbers. The first "fix" applied to the EDB address removed the apartment number (or analogue) out of the STREET field. This fix cleared the path for the subsequent five fixes that were applied to the STREET field.
  2. In cases where the street NAME was actually a number (e.g., 25th Street, 1st Avenue, etc.), the Geocode program failed to find a valid match for the street if the suffix was missing from the numbered street. The suffix was almost always missing in the EDB address fields. We tested the suffix problem manually, and found that the simple addition of a suffix could, in many cases, turn a null match into an exact match. Numerical street names appear in a variety of patterns in the STREET variable, and four out of the five remaining fixes were designed to detect these patterns, and make the appropriate changes.
  3. In some records, the street address contained what appeared to be a double street number - one 2- or 3-digit number, followed by a space, then another 2- or 3-digit number. We discovered that in some places, particularly Queens, NY, the space needs to be replaced by a dash. In other places, however, it is unclear if the double number with a space is valid, or if the space should be deleted. In those cases, the double number was left as is.

For each fix, the SAS program outputs a text file listing, for each "fixed" record, the Medicare beneficiary's HIC number, the observation number, the address in it's original, "pre-fixed" format, the pattern of the new format, and the actual "fixed" address. This allowed us to check that the fix actually did what we expected it to, and it provides a record of the difference between the old addresses and the new addresses.

Output corrected addresses. The SAS program uses the PUT statement in conjunction with the FILE statement to output a single ASCII text file (*.txt) of addresses in the STREET, CITY, STATE ZIP format. This file contains all of the addresses that have been cleaned (100 percent of the records that were run through the fixes, or about 87.5 percent of the total number of beneficiary records). During testing we started with a CAHPS-matched EDB file with 830,728 records, which was reduced to 760,961 after the SAS program was run.

7This excludes 266 beneficiaries who were originally coded as missing in the EDB but are now coded as Hispanics. Beneficiaries who were already coded as Hispanic in the EDB are also not included in this total.

8This excludes 68 beneficiaries who were originally coded as missing in the EDB but are now coded as A/PI. Beneficiaries who were already coded as A/PI in the EDB are also not included in this total.

9There are also several analogues to apartment number that appear in address fields, including suite number, lot number (in the case of mobile home parks), unit number, etc.

10The GeoCode program does not match addresses in Puerto Rico.

11 The "fixes" were numbered according to the order in which they were developed. However, the order in which they were applied in the SAS programs does not follow this numbering. Some fixes developed later (Fix 5, for example) had to be applied before earlier fixes.

Return to Contents
Proceed to Next Section

Current as of January 2008
Internet Citation: Chapter 2a: Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for MEdicare Beneficiaries - Chapter 2a. January 2008. Agency for Healthcare Research and Quality, Rockville, MD.