Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement

1. Introduction

Ensuring the delivery of high-quality, patient-centered care requires understanding the needs of the populations served. The nation's health care data infrastructure does not provide the necessary level of detail to understand which groups are experiencing health care disparities or would benefit from targeted quality improvement efforts. Categories for collection and methods of aggregation for reporting race, ethnicity,and language data vary. Challenges to improving data quality include nonstandardized categories, a lack of understanding of why data are collected, health information technology (Health IT) limitations, and a lack of sufficiently descriptive response categories, among others. Throughout the course of this report, the subcommittee addresses these challenges as it recommends a standardized approach to eliciting race, ethnicity, and language data and defines a standard set of categories for these data.

Hennepin County Medical Center in Minneapolis, Minnesota, may very well be one of the Midwest's most diverse hospitals. Its patient population includes persons of Somali, Mexican, Ecuadorian, Russian, Vietnamese, and Bosnian heritage, born in this country or elsewhere, to name but a few of the populations in a state that has historically been populated by persons identifying themselves as White and of German and Scandinavian origin. As a March 2009 New York Times profile of the hospital emphasized, each of these ethnic groups brings "distinctive patterns" of illness, injury, language, and health beliefs (Grady, 2009), all of which affect how health professionals can best provide safe, timely, effective, patient-centered, efficient, and equitable care, as delineated in the Institute of Medicine's 2001 report, Crossing the Quality Chasm: A New Health System for the 21st Century (IOM, 2001).

Cultural lifestyle patterns (e.g., food choices and smoking habits) and beliefs about the use of health care influence the quality of care received regardless of the person's country of origin, language, immigration status, or socioeconomic status (SES). The importance of knowing a patient's race, ethnicity, and language need is not limited to understanding the issues facing recent immigrants' health access or outcomes; race, ethnicity, and language data can reveal risks for health care disparities in native-born as well as foreign-born populations. Such data ideally allow:

  • Targeted interventions by health plans and health system providers when certain populations have higher than average or potentially avoidable hospitalizations.
  • Identification of differentials in health status, quality of care, and outcomes among populations (even when insurance status is the same) by agencies such as the Centers for Medicare and Medicaid Services (CMS).
  • Planning of language assistance services to support physicians and other staff that interact directly with diverse patient populations.
  • Development of health promotion outreach strategies to specific groups (e.g., outreach efforts to Somali women who are susceptible to vitamin D deficiency to prevent later, more costly emergency department visits for diagnosis and pain treatment) by public health departments and health care providers working in collaboration.

One of the biggest barriers most health systems face in improving quality and reducing disparities within their own walls is systematically identifying the populations they serve, addressing the needs of these populations, and monitoring improvements over time. This systematic analysis may reveal no disparities in the delivery of health care, but that different groups may have different health care needs (e.g., educating Somali women on the need for vitamin D, earlier cancer screening for racial and ethnic groups at increased risk, addressing ethnocultural beliefs regarding temperature and onset of childhood asthma among Puerto Ricans, therapeutic strategies to reduce risk of diabetic kidney disease among Pima Indians) (American Cancer Society, 2009; Grady, 2009; Pachter et al., 2002; Pavkov et al., 2008). Identification of differences has the ultimate goal of being able to improve the quality of care for each person to enhance his or her health.


Strong evidence exists that there are disparities in health and the quality of health care received by different populations (AHRQ, 2008; IOM, 2003; Kaiser Family Foundation, 2009). In conceptualizing an approach to addressing disparities in health care systems, Kilbourne and colleagues describe three critical phases: detection of disparities, understanding of factors, and development and implementation of interventions (Figure 1-1) (Kilbourne et al., 2006). The detection phase includes three key components: defining health care disparities, identifying vulnerable populations, and developing valid measures. The detection phase requires organizations to systematically collect relevant demographic data and to link these data to measures of quality. This phase brings health systems one step closer to understanding where the disparities (or differential health care needs) exist, which can lead to understanding why they exist and identifying some of the causal factors. Once systems have detected and understood disparities, they are better positioned to develop and implement targeted interventions to reduce those disparities (Kilbourne et al., 2006). The fundamental step is collecting data that adequately describe populations, allowing for the stratification of quality measures at a level of detail that can identify variation in health and health care among at-risk groups (Hasnain-Wynia Rittner, 2008).

The subcommittee's task is to develop recommendations on standardized categories of race, ethnicity, and language data to support the processes of recognizing differential needs in health care, and identifying and reducing or eliminating disparities. Race, ethnicity, and language information can inform point of care needs, application of resources, and decisions in patient—provider interactions in ways that improve absolute levels of health care quality for all. At the microsystem level, physician practices and individual hospitals can use data to understand the population being served, address disparities in care that exist, and monitor improvements over time. At an intermediate level, data can be used—for example by health plans or states—to make cross-institutional comparisons to detect variations in quality of care between entities serving similar populations. And at the macro level, through national reporting and aggregation, population data can indicate where consistent disparities in care exist nationally (Thomas, 2001).

This chapter provides background on key issues and challenges surrounding the categorization and collection of race, ethnicity, and language data for health care quality improvement. First, the complexity of defining the concepts of race and ethnicity is explored. Next, the chapter examines challenges to the collection of these demographic data, the impetus for standardization, the utility of the current Office of Management and Budget (OMB) race and Hispanic ethnicity categories, and the need for more detailed data on race, ethnicity, and language need. The chapter concludes by reviewing the subcommittee's study charge and providing an overview of the remainder of this report.

Defining Race and Ethnicity

The concepts of race and ethnicity are defined socially and culturally and, in the case of federal data collection, by legislative and political necessity (Hayes-Bautista and Chapa, 1987). OMB, for example, states that race and ethnicity categories "are social-political constructs and should not be interpreted as being scientific or anthropological in nature" (OMB, 1997a). Scientific findings provide empirical evidence that there is more genetic variation within than among racial groups; thus, racial categories do not represent major biological distinctions (Cooper and David, 1986; Williams, 1994; Williams et al., 1994) and instead capture socially constructed intersections of political, historical, legal, and cultural factors.

People have been racially categorized by the federal government since the first U.S. Census was conducted in 1790 (Bennett, 2000). Since then, the national statistical system has employed a variety of racial categories, most of which stem from racial classifications that originated in the mid-eighteenth century (Witzig, 1996). Commentators have noted that it is remarkable how little the categories have changed, despite what is now known about the lack of correlation between racial phenotypes and genetic differences (Cavalli-Sforza et al., 1994; Diamond, 1994; Witzig, 1996).

The complex history of racial identification in the United States (Byrd and Clayton, 2000; Smedley, 1999) results in concepts of race and ethnicity that not only have changed over time1, but also are subject to self-perceptions, which may also change (Ford and Kelly, 2005; Hahn, 1992); technical decisions defining who belongs in which category; and the perceptions of a person recording another individual's race. The 2000 Census: Counting Under Adversity provides an extensive review of the historical development of the racial and ethnic classifications used by the Bureau of the Census. Chapter 3 in Multiple Origins, Uncertain Destinies: Hispanics and the American Future reviews the origins of Hispanic ethnicity and its relationship to race. In the latter instance, for example, individuals who self-identify as American Indians are frequently classified as White by health care workers when a determination is made by observation alone, without self-report (Izquierdo and Schoenman, 2000).


Imprecision in defining and using the terms race and ethnicity is apparent in the conflicting and overlapping terminologies used even by the government bodies responsible for statistical data collection and classification. In some instruments, the federal government considers race and ethnicity to be distinct concepts (Grieco and Cassidy, 2001); in other instruments, questions on race include racial, national origin, and ethnicity response options. The term race is often used synonymously with ethnicity, ancestry, nationality, and culture (Williams, 1994; Yankauer, 1987). For example, Census 2000 and 2010 forms ask, "What is this person's race?" (U.S. Census Bureau, 2009) and provide response categories that blur definitions of race, national origin, and ethnicity. Such practices both reflect and reinforce the lack of uniformity in how the term ethnicity is perceived (Macdonald et al., 2005; Thernstrom et al., 1980). The term Hispanic is often listed alongside terms that define racial groups (e.g., Asian and White), resulting in many Hispanics beginning to view themselves as a separate race. Thus, when Hispanics are required to choose a race in addition to their Hispanic ethnicity, many self-identify as "Some other race" (NRC, 2006). The Census Bureau's definition of "Some other race" is included in Table 1-1.


Race and ethnicity can be important statistical predictors of an individual's risk for good or poor health outcomes and access to care (NRC, 2004b; Wallman et al., 2000; Williams, 1994). However, a multitude of factors that are both correlated with and independent of race and ethnicity may affect group differences in health and health care. The model presented in Figure 1-2 indicates the complex relationships between environmental conditions, socioeconomic status, discrimination, racism, and health care. In this model, health care (called medical care in the figure), or lack thereof, is viewed as both a risk factor and resource that impacts an individual's health status. Because of the complex relationships depicted in this model, the concepts of race and ethnicity should be dealt with deliberatively, purposefully, and thoughtfully (Williams et al., 1994).

A 2004 National Research Council committee charged with defining the measurement of racial discrimination concluded that "race is a salient aspect of social, political, and economic life" and that collecting data on race and ethnicity is therefore necessary to "monitor and understand differences in opportunities and outcomes for population groups" (NRC, 2004c, p. 33). Thus, while there have been flaws in applying the terms race and ethnicity, the terms remain important to use in distinguishing the diversity of the U.S. population.

While recognizing a certain lack of precision and consistency in the terms race and ethnicity for defining population groups that would be unacceptable with any other variable used in scientific inquiry (Kagawa-Singer, 2009), the subcommittee chose to adopt the definitions put forth in the 2003 IOM report Unequal Treatment: Confronting Racial and Ethnic Disparities in Healthcare. Race is considered a "socioeconomic concept wherein groups of people sharing certain physical characteristics are treated differently based on stereotypical thinking, discriminatory institutions and social structures, a shared worldview, and social myths" (IOM, 2003, p. 525)2. Other definitions of race abound. For example, OMB states that race and ethnicity should not be interpreted as being primarily biological or genetic in reference, but rather, thought of in terms of social and cultural characteristics as well as ancestry (OMB, 1997b). The Census Bureau complies with the OMB standards, noting that the standards "generally reflect a social definition of race recognized in this country. They do not conform to any biological, anthropological or genetic criteria" (U.S. Census Bureau, 2001). For the purposes of this report, the subcommittee considers ethnicity to be a concept referring to a shared culture and way of life, especially reflected in language, religion, and material culture products (IOM, 2003). The subcommittee makes a distinction between the limited OMB and Census Bureau use of the term ethnicity to connote solely Hispanic ethnicity and the concept of granular ethnicity advanced in this report and further defined in Chapters 2 and 3. Additionally, the subcommittee recognizes that linguistic barriers can present significant challenges to both patients and providers and thus has adopted a definition of language that is inclusive of communication needs. This report develops an approach to the collection of data on these key variables and offers a framework of race, ethnicity, and language categories and questions for the collection and use of these data in health care quality improvement efforts.

Challenges to Collecting Race, Ethnicity, and Language Data

A variety of entities, such as states, health plans, health professionals, hospitals, community health centers, nursing homes, and public health departments—as well as the public—play roles in obtaining, sharing, and using race, ethnicity, and language data. All of these entities, though, have different reasons for and ways of categorizing, collecting, and aggregating these data. In interviews and testimony before the subcommittee, representatives of hospitals, health plans, physicians, and custodians of federal health care databases consistently identified several challenges to improving the quality and availability of race, ethnicity, and language data in patient—provider encounters and at various levels of the health care system (Box 1-1). The principal challenges in obtaining these data for use in quality improvement assessments include a lack of standardization of categories to foster data sharing and aggregation (Lurie et al., 2005; Siegel et al., 2007), a lack of understanding of why the data are being collected

(Hasnain-Wynia et al., 2007; Regenstein and Sickler, 2006), a lack of space on collection forms and in collection systems (Coltin, 2009; Hasnain-Wynia et al., 2007; Ting, 2009), health information technology (Health IT) limitations (e.g., field capacity and linkages among systems) (Coltin, 2009), and the fact that the current OMB categories are not sufficiently descriptive of locally relevant population groups (Friedman et al., 2000; NRC, 2004b). These issues, though challenging, are not insurmountable; thus, the subcommittee seeks to identify options for moving forward and improving the categorization, collection, and aggregation of race, ethnicity, and language data so


Box 1-1. Barriers to Collection of Race, Ethnicity, and Language Data

System Level Patient

  • Lack of standardization of categories.
  • Lack of understanding why data are collected.
  • Provided response categories not sufficiently descriptive for local populations.
  • Health IT limitations (number of fields, comparability of categories among systems).
  • Space on collection forms (paper or electronic).
  • Discomfort on part of person collecting.

Patient-Provider Encounter

  • Lack of standardization of categories.
  • Lack of understanding why data are collected.
  • Provided response categories not sufficiently descriptive for local populations to self-identity with
  • Privacy concerns.

Standardizing Categories

The reasons for standardizing race, ethnicity, and language categories for data collection for health care quality improvement are four-fold:

  1. Ensuring that equivalent categories are being collected and compared across settings.
  2. Minimizing the reporting burden that arises when multiple entities require different sets of incompatible categories.
  3. Optimizing the ability to share data across systems of payers, health care settings, government agencies, and political jurisdictions.
  4. Going beyond the OMB categories to develop response options that are more relevant for the identification of needs for quality improvement.

Sharing and comparing data across systems calls for a common vocabulary to avoid omission of categories that might be critical to monitoring disparities and to allow mapping of categories from one system to another.

The expansion of electronic health records (EHRs3) and integration of data systems creates an opportunity to establish uniform categories and coding practices. EHRs are further defined in Chapter 6 of this report. Developing linkages among health data systems would provide a more comprehensive picture of health care quality. Doing so would be greatly facilitated by having the ability to "read" comparable data from disparate sources, a proposition that requires standardized categories, coding, and procedures for aggregating granular data to broader categories whenever necessary.

Current Status of National Standards for Categorizing and Collecting Race, Ethnicity, and Language Data

In specifying a system that can provide uniformity and comparability in the collection and use of data by federal agencies, OMB provides a minimum standard for collecting and presenting data on Hispanic ethnicity and race (Box 1-2) (OMB, 1997b). The driving force for the development of this standard in the 1970s was the need for comparable data for civil rights monitoring; thus the categories reflect legislatively based priorities for data on particular population groups, including congressionally mandated separate counts of the Hispanic population (Wallman et al., 2000). Because the standard was not designed with regard to health or health care specifically,the groups identified by the OMB categories may not be the only analytic groups useful for advancing health care quality improvement.

The OMB standard was envisioned as a minimum reporting requirement, and more discrete categorization is encouraged as long as these categories can be rolled up to the six OMB race and Hispanic ethnicity categories (OMB, 1997a). For example, the Census Bureau and some Department of Health and Human Services (HHS)—sponsored national surveys use the OMB minimum categories plus other categories that can be aggregated into the minimum categories for analysis and reporting.

No nationally standardized minimum set of languages comparable to the OMB race and Hispanic ethnicity categories exists. HHS, in conformance with Department of Justice principles to prevent discrimination and to ensure access to federally funded programs, has provided guidance on the importance of collecting language data (HHS, 2003) in its Culturally and Linguistically Appropriate Services (CLAS) standards. Four of the 14 standards are federally mandated for all health care organizations that receive federal funds. These organizations must offer and provide competent language assistance services and must make documents available in "the languages of the commonly encountered groups and/or groups represented in the service area." The CLAS standards do not list language categories to be used for data collection and analysis but seek to ensure the provision of language assistance services and culturally competent care in all health care settings (Office of Minority Health, 2001).

In agencies that are not federal or organizations that do not receive federal funds or federal contracts, race, ethnicity, and language data may not be collected because state, local, and private sector data collection is not universally mandated. Furthermore, those data that are collected do not necessarily adhere to a uniform set of categories; hospitals, health plans, community health centers, employers, and providers collect data in disparate ways.


Box 1-2. The 1997 OMB Revisions to the Standards for the Classification of Federal Data on Race and Hispanic Ethnicity

Hispanic Ethnicity

  • Hispanic or Latino origin.
  • Not of Hispanic or Latino origin.


  • American Indian or Alaska Native.
  • Asian.
  • Native Hawaiian or Other Pacific Islander.
  • Black or African American.
  • White.


  • Designed to be minimum categories. Additional categories can be used provided they can be aggregated into the standard categories.
  • Requires separate collection of Hispanic ethnicity and race data.
  • Requires Hispanic ethnicity question before race question, when the two-question format is used.
  • Requires allowance for selection of more than one race category (e.g., "Select one or more").
  • Preference for self-reported race and Hispanic ethnicity.

Use of the Standards

  • Used at a minimum for all federally sponsored statistical data collections that include data on race and ethnicity.
An Approach to Improving the Categorization and Aggregation of Data

The OMB categories are not sufficiently descriptive to distinguish among locally relevant ethnic populations that face unique health problems and may have dissimilar patterns of care and outcomes (Hasnain-Wyni and Baker, 2006). When more detailed data are collected and used locally, aggregation to the OMB categories loses detailed quality-related information for specific populations. As linkages among quality reporting systems become more common and allow aggregation of data from multiple sources, consistent methods of identifying subgroups will facilitate more robust analyses of detailed population data at the local, regional, state, and national levels. Any national standard list of categories for those subgroups must capture the full diversity of the U.S. population.

The keys to the usefulness of such a list across the country are balancing that  comprehensiveness with the desired level of granularity to describe locally pertinent groups, and resolving any administrative and logistical barriers to collecting a sufficient number of informative categories to help guide quality improvement.

The three principal means of obtaining race, ethnicity, and language data are self-report, observation, and indirect estimation. Self-report, which reflects how individuals view themselves, is the widely preferred approach as it has been adopted by OMB (OMB, 1997b) and is considered by researchers to be the "gold standard" (Higgins and Taylor, 2009; Wei et al., 2006). The Interagency Committee for the Review of the Racial and Ethnic Standards reviewed the OMB standards prior to the 1997 revisions and determined that self-report respects "individual dignity" by allowing an individual to determine how he or she classifies himself or herself as opposed to classification being assigned by another person (OMB, 1997a).

The Health Research and Educational Trust (HRET) Toolkit and the National Health Plan Collaborative provided guidance on collecting data on race, Hispanic ethnicity, more detailed ethnicity, and language need (Hasnain-Wynia et al., 2007; NHPC, 2008). The HRET Toolkit was recently endorsed by the National Quality Forum (NQF, 2008); however, the languages are limited to those most common at the national level, it includes a single "multiracial" category instead of an instruction to allow persons to "Select one or more," and there is no "Other, please specify:__" option to capture additional categories with which individuals identify. Therefore, the framework for categorization and collection spelled out by this report provides a national standard for more thorough categorization and collection than has previously been put forth.

Addressing the Legality and Understanding the Purposes of Data Collection

The collection of data is impaired when its need is not well understood by health professionals and intake workers, and especially by patients themselves. Clinicians and administrators too often misperceive legal barriers and furthermore do not expect to see any disparities in their practice. Despite evidence of disparities at all levels of health and health care systems, hospital executives, physicians, and staff, for example, may believe that disparities are not a problem in their respective institutions (Weinick et al., 2008). Some worry that soliciting the information may put them at risk for offending patients, or if disparities are found, for accusations of discrimination (Hasnain-Wynia et al., 2004). Similarly, health plans have been concerned that they could be viewed as subjecting certain populations to discriminatory treatment by asking for such data in advance of enrollment. In fact, a few states prohibit the acquisition of race and ethnicity data at enrollment, but not thereafter. California, Maryland, New Hampshire, New Jersey, New York, and Pennsylvania prohibit insurers from requesting an applicant's race, ethnicity, religion, ancestry, or national origin in applications, but the states allow insurers to request such information from individuals after enrollment, but not thereafter4. A 2009 analysis of federal and state laws found no federal laws or regulations prohibiting health plans from collecting race and ethnicity data (AHIP, 2009).

The HRET Toolkit, the National Health Law Program (NHeLP), and the HHS Office of Minority Health (OMH) all emphasize that the collection of race, ethnicity, and language data is permitted under Title VI of the Civil Rights Act of 1964 and is, in fact, necessary to ensure compliance with the statute (Berry et al., 2001; Hasnain-Wynia et al., 2007; Perot and Youdelman, 2001)5. The Civil Rights Act requires recipients of federal financial assistance to collect information that demonstrates compliance, including "racial and ethnic data showing the extent to which members of minority groups are beneficiaries of and participants in federally-assisted programs."6 Furthermore, a July 2008 law7 mandated the Secretary of HHS to implement the collection of race, ethnicity, and gender data in the Medicare program in fee-for-service plans, Medicare Advantage private plans, and Part D prescription drug plans. The American Recovery and Reinvestment Act of 2009 (ARRA8) also lays out expectations for the collection of race, ethnicity, and language data by specifying the inclusion of these variables in EHRs.

Although the legal basis for the collection of race and ethnicity data is well documented (AHIP, 2009; Perot and Youdelman, 2001; Rosenbaum et al., 2007; Youdelman and Hitov, 2001) and at least 80 program-specific statutes require the reporting and collection of race, ethnicity, and language data (Youdelman and Hitov, 2001), health care organizations may still perceive legal barriers, including concerns about the applicability of Health Insurance Portability and Accountability Act of 1996 (HIPAA9) regulations, to collecting, sharing, and reporting these data. HIPAA restricts the use and disclosure of identifiable health information, but does not limit the collection of demographic data for quality improvement purposes (Kornblet et al., 2008).

A 2007 National Committee on Vital Health Statistics (NCVHS) report addresses the concern of the potential of harm arising from the use of data enabled by their collection and exchange through Health IT. The report acknowledges the potential for "discrimination, personal embarrassment, and group-based harm" when the data are compiled and exchanged (NCVHS, 2007, p. 5). The report recommends the protection of all uses of health data by all users under a framework of data stewardship, a concept that encompasses "the responsibilities and accountabilities associated with managing, collecting, viewing, storing, sharing, disclosing, or otherwise making use of personal health information" (AMIA, 2007), and the subcommittee agrees.

Efforts to collect these data may also be hampered by intake workers and patient registration staff who feel uncomfortable soliciting them from patients, and who feel burdened by collecting data whose importance they do not understand and cannot adequately explain if patients challenge the need for these data. Patients, meanwhile, may be hesitant to provide race, ethnicity, and language data because of concerns about privacy and their own uncertainty as to why these data are needed. Perceived experiences of discrimination in medical care have been found to be associated with greater apprehension about providing race and ethnicity information among, for example, Blacks, Hispanics, and Mandarin/Cantonese-speaking Asians (Kandula et al., 2009). Potential health plan enrollees, for instance, may fear discriminatory access to coverage, while hospital patients may worry that language questions serve as a proxy for questions about immigration status.

Addressing Health Information Technology (Health IT) Issues

Advances in Health IT, including recent federal government financing and support, may open doors to advance data collection. Currently, however, collecting and utilizing race, ethnicity, and language data in health care settings may be complicated by challenges in capturing sufficient data and in linking available data from disparate sources (Schoenman et al., 2007). For example, many hospitals and physician offices that collect these data enter them with other demographic characteristics at intake. These demographic data, then, are typically included in practice management systems, which are separate from the Health IT systems that capture clinical information used in quality measurement.

In many health care settings, space on data collection forms and space constraints in Health IT systems can be barriers to including detailed demographic data (Hasnain-Wynia et al., 2007). For example, while OMB stipulates the separate collection of race and Hispanic ethnicity data, some legacy Health IT systems allow only one field for capturing both elements. Similarly, some Health IT systems are unable to collect the multiple responses that result from the "Select one or more" approach required by OMB (Coltin, 2009).

Some Health IT collection systems utilize drop-down screens and keystroke pattern matching to increase the number of category choices they can offer. Other paper and electronic systems default to lengthy lists that are time-consuming for both staff and patients to comb through, or use shorter lists and classify many persons under an indiscriminant "other" category. Open-ended questions (e.g., "Other, please specify:__"), which allow write-in responses, may improve self-identification but can impose additional administrative burdens if labor-intensive manual coding must be undertaken in the absence of automated systems or optical scanning technology. However, the use of "Other, please specify:__" as an adjunct check-off box captures respondent answers and is thus useful to more accurately describing all members in a service population.

Page last reviewed April 2018
Page originally created September 2012
Internet Citation: 1. Introduction. Content last reviewed April 2018. Agency for Healthcare Research and Quality, Rockville, MD. https://www.ahrq.gov/research/findings/final-reports/iomracereport/reldata1.html
Back To Top