Slide Presentation from the AHRQ 2009 Annual Conference
On September 14, 2009, Sara Rosenbaum and Alan Karr made this presentation at the 2009 Annual Conference. Select to access the PowerPoint® presentation (411 KB) (Plugin Software Help).
Linking Registry Data: Technical and Legal Considerations
Sara Rosenbaum, JD
George Washington University
Alan F. Karr, PhD
National Institute of Statistical Sciences
Authors and Reviewers
- Stephen E. Fienberg (lead)
- Carnegie Mellon University
- Julia Lane
- National Science Foundation
- Sara Rosenbaum (lead)
- George Washington University
- Eric Peterson
- Duke University Medical Center
- Susan Adams (lead)
- Dartmouth College
- Victoria Prescott
- McBroom Consulting, LLC
- Alan F. Karr
- National Institute of Statistical Sciences
- Gerald Riley
- Centers for Medicare & Medicaid Services
- Bradley Malin
- Vanderbilt University
- Marcy Wilder
- Hogan & Hartson
- Deven McGraw
- The Center for Democracy and Technology
- Maya A. Bernstein
- Office of the Assistant Secretary for Planning and Evaluation, DHHS
- Melissa M. Goldstein
- George Washington University
- Joy Pritts
- Georgetown University
- Andy DeMayo
Increasingly, statistical methods are used to link data from multiple de-identified sources.
- What is the risk of identifying patients by combining data from multiple registries?
- What are the legal and ethical requirements on researchers to insure patient privacy and confidentiality?
B.) TECHNICAL ASPECTS OF DATA LINKAGE PROJECTS
- Linking records for research and improving public health
- What do Privacy, Disclosure, and Confidentiality mean?
- Linking records and probabilistic matching
- Procedural issues in linking datasets
C.) LEGAL ASPECTS OF DATA LINKAGE PROJECTS
- Risks of identification
- The HIPAA Privacy Rule
Paper Overview (cont.)
D. RISK MITIGATION FOR DATA LINKAGE PROJECTS
- Methodology for mitigating the risk of re-identification
- Security practices, standards, and technologies
- Linking Clinical Registry Data with Insurance Claims Files
- Planning for Data Linkage Projects
High-Level View: Linking records for research and improving public health
- The scientific value of a registry increases with the number of cases and the extent of the health information included.
- There is an ethical obligation to protect patient interests when collecting, sharing, and studying person-specific biomedical information.
- Thus, a tension exists between the broad goals of registries and regulations protecting individually identifiable information.
- A large body of federal law applies to health information privacy.
- Privacy: protection of people against unallowed uses of PII (specifically, PHI)
- Disclosure: Attribution of information to source of data
- Confidentiality: protection accorded to statistical data
Technical Aspects: Privacy, Disclosure and Confidentiality Privacy
- As used in the HIPAA Privacy Rule, the term applies to protected health information (PHI).
Technical Aspects: Privacy, Disclosure and Confidentiality (cont.)
- Technical: the attribution of information to the source of the data.
- Identity disclosure occurs when the data source becomes known from the data release itself
- Attribute disclosure occurs when the released data make it possible to infer the characteristics of an individual data source more accurately than would have otherwise been possible
- Inferential disclosure relates to the probability of identifying a particular attribute of a data source.
- HIPAA: the release, transfer, provision of, access to, or divulging in any other manner of information outside of the entity holding the information.
Technical Aspects: Privacy, Disclosure and Confidentiality (cont.)
- A quality or condition of protection accorded to statistical information as an obligation not to permit the transfer of that information to an unauthorized party.
- A different notion of confidentiality relates to the ethical, legal, and professional obligation of those who receive information in the context of a clinical relationship to respect the privacy interests of their patients.
Technical Aspects: Linking records and probabilistic matching
Techniques for record linkage
- Unique identifiers
- AI-like rule
- Probabilistic approaches
- Probabilistic approach is built on five key components:
- Define features that describe similarity between records.
- Place feature vectors into three classes: matches (M), non-matches (U), and possible matches (P).
- Perform record-pair classification by calculating the ratio (P (Y | M)) / (P (Y | U)) for each pair, where Y is a feature vector for the pair and P (Y | M) and P (Y | U) are the probabilities of observing that feature vector for a matched and non-matched pair.
- Where no duplicate and/or non-duplicate record pairs are available, estimate conditional probabilities by using observed frequencies in the records to be linked.
- "Blocking," or partitioning the databases based on some variable in both databases, improves efficiency.
Technical Aspects: Procedural issues in linking data sets
- Neither "data" nor "link" can be defined unambiguously, and the relationship between datasets can vary.
- Linking horizontally partitioned datasets carries little risk of re-identification, because in most cases there is no more information about a record on the combined dataset than was present in the individual datasets.
- For vertically partitioned datasets, it is necessary to link individual subjects' records that are contained in two or more datasets. This process is risky because the combined dataset contains more information about each subject than either of the components.
- Preferred approach: methods based on cryptography (complex and may involve a third party)
- More common approach: remove identifiers and carry out statistical disclosure limitation prior to linkage (may introduce errors into the linked dataset that alter results of statistical analyses)
Technical Aspects: Procedural issues in linking data sets (cont.)
- Many linkage techniques depend on the presence of attributes in both databases that are unique to individuals but do not lead to re-identification.
- Linkage can reduce data quality.
- No matter how linkage is performed, other issues should be addressed:
- Comparable attributes should be expressed in the same units of measure
- Conflicting values of attributes for each individual common to both databases should be reconciled
- Managing records that appear in only one database (most commonly they are dropped)
- Consider effect of linkage on data quality
- There are unremovable risks from data linkage. Strong consideration should be given to forms of data protection such as licensing and restricted access.
Risk Mitigation: Methodology for mitigating the risk of re-identification
- Basic methodology for statistical disclosure limitation
- "Disclosure limiting masks" are transformations of the data where there is a specific functional relationship between masked values and original data.
- Can be categorized as suppressions (e.g., cell suppression), re-codings (e.g., collapsing rows or columns, or swapping), or samplings (e.g., releasing subsets).
- The Risk-Utility tradeoff
- Risk of disclosure is balanced with the utility of the released data.
- Privacy-preserving data mining methodologies
- Cryptographic approaches to privacy protection
- Differential privacy focuses on algorithmic aspects of the problem with an emphasis on automation and scalability of a process for conferring anonymity
- Limits the information a data user might learn beyond that known before exposure to the released statistics
Risk Mitigation: Security practices, standards and technologies
- Philosophies regarding the preservation of confidentiality associated with individual-level data:
- Restricted or limited information, with restrictions on the amount or format of the data released
- Restricted or limited access, with restrictions on the access to the information itself.
- Ensure that researchers are accountable for the use of datasets (e.g., best practices, unique logins, user authentication, audit trails)
- Registries as data enclaves
- "Research data centers" where users can access and use data in a regulated environment
- Layered restricted access to databases
- A form of layered restrictions that combines two approaches with differing levels of access at different levels of detail in the data
- Critical starting point: nature of the research undertaking
- Health care operations?
- HIPAA Privacy and Security Rules
- Health care quality related activities
- Public health practice?
- Research within meaning of Common Rule?
- Creation of general knowledge
- Some combination of the three?
Do HIPAA Privacy and Security Rules Apply? And if so, What are the Issues?
Are the data PHI, and is the source a covered entity? If so, then HIPAA privacy and security standards apply
Is the data source a covered entity
- (ARRA expands to include business associates)
- De-identification and re-identification of data
- Data use agreements for limited data sets
- Security obligations for ePHI
Do the Data and Data Source Raise Other Legal Obligations?
Do patients and the custodial institutions from whom the data are secured have other legal rights and interests that create legal obligations?
- E.g., more stringent state privacy laws
- Were confidentiality expectations created?
- Institutional privacy expectations
- Special federal or state standards applicable to substance abuse or mental illness information
- This white paper describes technical and legal considerations for researchers interested in creating data linkage projects involving registry data, and presents typical linkage methods. It also discusses both the hazards for re-identification created by data linkage projects, and the statistical methods used to minimize the risk of re-identification.
- Some limitations of this discussion are the exclusion of:
- Considerations about linking data from public and private sectors, where different ethical and legal restrictions may apply, and
- Detailed information about the risks involved with identifying the health care providers that collect and provide data.
- Dataset linkage entails the risks of loss of reliable confidential data management and identification or re-identification of individuals and institutions. Recognized and developing statistical methods and secure computation may limit these risks and may allow the public health benefits that registries linked to other datasets have the potential to contribute.
Current as of December 2009
Linking Registry Data: Technical and Legal Considerations. Slide Presentation from the AHRQ 2009 Annual Conference (Text Version). December 2009. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/about/annualconf09/karr_rosenbaum.htm