Linking Registry Data: Technical and Legal Considerations (Text Version)

Slide presentation from the AHRQ 2009 conference

On September 14, 2009, Sara Rosenbaum and Alan Karr made this presentation at the 2009 Annual Conference. Select to access the PowerPoint® presentation (411 KB) (Plugin Software Help).


Slide 1

Slide 1. Linking Registry Data: Technical and Legal Considerations

Linking Registry Data: Technical and Legal Considerations

Sara Rosenbaum, JD
George Washington University

Alan F. Karr, PhD
National Institute of Statistical Sciences

Slide 2

Slide 2. Authors and Reviewers

Authors and Reviewers

AuthorsReviewers
Stephen E. Fienberg (lead)
Carnegie Mellon University
Julia Lane
National Science Foundation
Sara Rosenbaum (lead)
George Washington University
Eric Peterson
Duke University Medical Center
Susan Adams (lead)
Dartmouth College
Victoria Prescott
McBroom Consulting, LLC
Alan F. Karr
National Institute of Statistical Sciences
Gerald Riley
Centers for Medicare & Medicaid Services
Bradley Malin
Vanderbilt University
Marcy Wilder
Hogan & Hartson
Deven McGraw
The Center for Democracy and Technology
 
Maya A. Bernstein
Office of the Assistant Secretary for Planning and Evaluation, DHHS
 
Melissa M. Goldstein
George Washington University
 
Joy Pritts
Georgetown University
 
Andy DeMayo
Outcome
 

 

Slide 3

Slide 3. Purpose

Purpose

Increasingly, statistical methods are used to link data from multiple de-identified sources.

  • What is the risk of identifying patients by combining data from multiple registries?
  • What are the legal and ethical requirements on researchers to insure patient privacy and confidentiality?

Slide 4

Slide 4. Paper Overview

Paper Overview

A) INTRODUCTION
B.) TECHNICAL ASPECTS OF DATA LINKAGE PROJECTS

  • Linking records for research and improving public health
  • What do Privacy, Disclosure, and Confidentiality mean?
  • Linking records and probabilistic matching
  • Procedural issues in linking datasets

C.) LEGAL ASPECTS OF DATA LINKAGE PROJECTS

  • Risks of identification
  • The HIPAA Privacy Rule

Slide 5

Slide 5. Paper Overview (cont.)

Paper Overview (cont.)

D. RISK MITIGATION FOR DATA LINKAGE PROJECTS

  • Methodology for mitigating the risk of re-identification
  • Security practices, standards, and technologies

E: SUMMARY
F: SCENARIOS

  • Linking Clinical Registry Data with Insurance Claims Files
  • Planning for Data Linkage Projects

Slide 6

Slide 6. High-Level View: Linking records for research and improving public health

High-Level View: Linking records for research and improving public health

  • The scientific value of a registry increases with the number of cases and the extent of the health information included.
  • There is an ethical obligation to protect patient interests when collecting, sharing, and studying person-specific biomedical information.
  • Thus, a tension exists between the broad goals of registries and regulations protecting individually identifiable information.
  • A large body of federal law applies to health information privacy.

Slide 7

Slide 7. Key Terms

Key Terms

  • Privacy: protection of people against unallowed uses of PII (specifically, PHI)
  • Disclosure: Attribution of information to source of data
  • Confidentiality: protection accorded to statistical data

Slide 8

Slide 8. Technical Aspects: Privacy, Disclosure and Confidentiality Privacy

Technical Aspects: Privacy, Disclosure and Confidentiality Privacy

  • As used in the HIPAA Privacy Rule, the term applies to protected health information (PHI).

Slide 9

Slide 9. Technical Aspects: Privacy, Disclosure and Confidentiality (cont.)

Technical Aspects: Privacy, Disclosure and Confidentiality (cont.)

Disclosure

  • Technical: the attribution of information to the source of the data.
    • Identity disclosure occurs when the data source becomes known from the data release itself
    • Attribute disclosure occurs when the released data make it possible to infer the characteristics of an individual data source more accurately than would have otherwise been possible
    • Inferential disclosure relates to the probability of identifying a particular attribute of a data source.
  • HIPAA: the release, transfer, provision of, access to, or divulging in any other manner of information outside of the entity holding the information.

Slide 10

Slide 10. Technical Aspects: Privacy, Disclosure and Confidentiality (cont.)

Technical Aspects: Privacy, Disclosure and Confidentiality (cont.)

Confidentiality

  • A quality or condition of protection accorded to statistical information as an obligation not to permit the transfer of that information to an unauthorized party.
  • A different notion of confidentiality relates to the ethical, legal, and professional obligation of those who receive information in the context of a clinical relationship to respect the privacy interests of their patients.

Slide 11

Slide 11. Technical Aspects: Linking records and probabilistic matching

Technical Aspects: Linking records and probabilistic matching

  • Techniques for record linkage
    • Unique identifiers
    • AI-like rule
    • Probabilistic approaches
  • Probabilistic approach is built on five key components:
    1. Define features that describe similarity between records.
    2. Place feature vectors into three classes: matches (M), non-matches (U), and possible matches (P).
    3. Perform record-pair classification by calculating the ratio (P (Y | M)) / (P (Y | U)) for each pair, where Y is a feature vector for the pair and P (Y | M) and P (Y | U) are the probabilities of observing that feature vector for a matched and non-matched pair.
    4. Where no duplicate and/or non-duplicate record pairs are available, estimate conditional probabilities by using observed frequencies in the records to be linked.
    5. "Blocking," or partitioning the databases based on some variable in both databases, improves efficiency.

Slide 12

Slide 12. Technical Aspects: Procedural issues in linking data sets

Technical Aspects: Procedural issues in linking data sets

  • Neither "data" nor "link" can be defined unambiguously, and the relationship between datasets can vary.
  • Linking horizontally partitioned datasets carries little risk of re-identification, because in most cases there is no more information about a record on the combined dataset than was present in the individual datasets.
  • For vertically partitioned datasets, it is necessary to link individual subjects' records that are contained in two or more datasets. This process is risky because the combined dataset contains more information about each subject than either of the components.
    • Preferred approach: methods based on cryptography (complex and may involve a third party)
    • More common approach: remove identifiers and carry out statistical disclosure limitation prior to linkage (may introduce errors into the linked dataset that alter results of statistical analyses)

Slide 13

Slide 13. Technical Aspects: Procedural issues in linking data sets (cont.)

Technical Aspects: Procedural issues in linking data sets (cont.)

  • Many linkage techniques depend on the presence of attributes in both databases that are unique to individuals but do not lead to re-identification.
  • Linkage can reduce data quality.
  • No matter how linkage is performed, other issues should be addressed:
    • Comparable attributes should be expressed in the same units of measure
    • Conflicting values of attributes for each individual common to both databases should be reconciled
    • Managing records that appear in only one database (most commonly they are dropped)
    • Consider effect of linkage on data quality
  • There are unremovable risks from data linkage. Strong consideration should be given to forms of data protection such as licensing and restricted access.

Slide 14

Slide 14. Risk Mitigation: Methodology for mitigating the risk of re-identification

Risk Mitigation: Methodology for mitigating the risk of re-identification

  • Basic methodology for statistical disclosure limitation
    • "Disclosure limiting masks" are transformations of the data where there is a specific functional relationship between masked values and original data.
    • Can be categorized as suppressions (e.g., cell suppression), re-codings (e.g., collapsing rows or columns, or swapping), or samplings (e.g., releasing subsets).
  • The Risk-Utility tradeoff
    • Risk of disclosure is balanced with the utility of the released data.
  • Privacy-preserving data mining methodologies
  • Cryptographic approaches to privacy protection
    • Differential privacy focuses on algorithmic aspects of the problem with an emphasis on automation and scalability of a process for conferring anonymity
    • Limits the information a data user might learn beyond that known before exposure to the released statistics

Slide 15

Slide 15. Risk Mitigation: Security practices, standards and technologies

Risk Mitigation: Security practices, standards and technologies

  • Philosophies regarding the preservation of confidentiality associated with individual-level data:
    • Restricted or limited information, with restrictions on the amount or format of the data released
    • Restricted or limited access, with restrictions on the access to the information itself.
  • Accountability
    • Ensure that researchers are accountable for the use of datasets (e.g., best practices, unique logins, user authentication, audit trails)
  • Registries as data enclaves
    • "Research data centers" where users can access and use data in a regulated environment
  • Layered restricted access to databases
    • A form of layered restrictions that combines two approaches with differing levels of access at different levels of detail in the data

Slide 16

Slide 16. Legal Considerations

Legal Considerations

  • Critical starting point: nature of the research undertaking
    • Health care operations?
      • HIPAA Privacy and Security Rules
      • Health care quality related activities
    • Public health practice?
    • Research within meaning of Common Rule?
      • Creation of general knowledge
    • Some combination of the three?

Slide 17

Slide 17. Do HIPAA Privacy and Security Rules Apply? And if so, What are the Issues?

Do HIPAA Privacy and Security Rules Apply? And if so, What are the Issues?

  • Are the data PHI, and is the source a covered entity? If so, then HIPAA privacy and security standards apply
    • Is the data source a covered entity
      • (ARRA expands to include business associates)
    • De-identification and re-identification of data
    • Data use agreements for limited data sets
    • Security obligations for ePHI

Slide 18

Slide 18. Do the Data and Data Source Raise Other Legal Obligations?

Do the Data and Data Source Raise Other Legal Obligations?

  • Do patients and the custodial institutions from whom the data are secured have other legal rights and interests that create legal obligations?
    • E.g., more stringent state privacy laws
    • Were confidentiality expectations created?
    • Institutional privacy expectations
    • Special federal or state standards applicable to substance abuse or mental illness information

Slide 19

Slide 19. Summary

Summary

  • This white paper describes technical and legal considerations for researchers interested in creating data linkage projects involving registry data, and presents typical linkage methods. It also discusses both the hazards for re-identification created by data linkage projects, and the statistical methods used to minimize the risk of re-identification.
  • Some limitations of this discussion are the exclusion of:
    • Considerations about linking data from public and private sectors, where different ethical and legal restrictions may apply, and
    • Detailed information about the risks involved with identifying the health care providers that collect and provide data.
  • Dataset linkage entails the risks of loss of reliable confidential data management and identification or re-identification of individuals and institutions. Recognized and developing statistical methods and secure computation may limit these risks and may allow the public health benefits that registries linked to other datasets have the potential to contribute.
Current as of December 2009
Internet Citation: Linking Registry Data: Technical and Legal Considerations (Text Version). December 2009. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/news/events/conference/2009/karr-rosenbaum/index.html