The NIH RFI on genomic data sharing — Protecting genomic data

6 min readMar 2, 2022

This is the third in a four part series, describing some of the key aspects of the NIH RFI on genomic data sharing.

Questions/answers regarding the NIH RFI on genomic data sharing

The NIH genomic sharing policy RFI includes a number of requests regarding de-identification, consent, linking and sharing data, harmonizing the policies with data management requirements, and how (and to what grants) these policies should apply. In this blog, I am going to focus on questions regarding de-identification, sharing and linking, practical ways of obtaining consent, and ways to use the peer-review process to improve and enhance the methods used to share, link, and protect genomic data.

Can genomic information just be redacted to protect an individual’s privacy?

The challenge with genomic information is that full redaction of genomic information–either by removing information or abstracting it to less specific information–can often render the data useless for analysis. While genomic information is highly specific and often unique to an individual, in isolation, the risk of re-identification of these highly specific and often unique sequences is low. Care must be taken when pooling and linking data that contains genetic information.

Can we use the safe harbor process to de-identify genomic data?

Given the unique characteristics of genomic data, and the complexity and rapidly evolving nature of genomic data sharing, expert determination may be the only option for de-identification of genomic information. The Omnibus Rule in 2013 did not include genetic information as one of the 18 direct identifiers as defined by HIPAA, but redaction of the 18 identifiers as part of Safe Harbor is not sufficient to classify datasets containing genetic information as de-identified.

What is expert determination, and how can we improve the process?

The application of expert determination as a method for assessing the disclosure risk of genomic data has the benefit of being tailored to individual datasets, with a bespoke handling of the trade-offs between utility and privacy. To be effective, expert determination requires robust quantitative estimates of the risk contained within a dataset. Such robust assessment of risk is based on the distributions of potentially identifying values within the data and the intersectionality of those distributions as compared to reference data. To support expert determination, the NIH should support additional research on mutant allele frequencies, non-coding DNA mutations, sequence variability in certain regions of the genome, frequency of silent mutations, chromosome phenotypes, single SNP and combinations of polymorphisms and other genetic and genomic characteristics. Without the underpinning of these baseline assessments, expert determination will be either too conservative or too permissive in the evaluation of disclosure risk.

For example, expert determination may be the only way to assess the risk of re-identification of somatic cell mutations (such as the BRCA-1 or CFTR genes). Common sequences such as the CFTR FΔ508 may not create a risk of re-identification (putting a cystic fibrosis patient in a group of about 30,000), while n-terminal missense mutations such as c.14C>T are very rare, and potentially identifiable. A more nuanced approach to redaction and de-identification is only possible with expert determination.

In addition, we should not assume that two de-identified data sets, when combined, will remain de-identified. Expert determination should be used to evaluate the risk of linked and combined datasets, allowing for more thoughtful approaches to protecting patient information.

Remediation could include abstraction (substituting “the existence of a mutation” for the actual mutation), redaction (removing the sequence) or requiring additional security and access controls (for example, not allowing the data to be downloaded or removed from a data enclave). This would create more flexible ways of managing and sharing genomic information for research purposes, while always considering the risks of re-identification.

What factors should be considered when using expert determination for genomic information?

Expert determination provides the only nuanced approach to managing genomic information, and can take into consideration key attributes of the genomic data to be shared:

whether the sequence is from a tumor or somatic cell line with tumor sequences at a lower risk for re-identification than inheritable or somatic cell lines
the frequency of specific mutations with rare or low frequency mutation at a higher risk for re-identification
the length of the sequence with shorter sequences at a lower risk for re-identification
the comprehensiveness of the dataset with more comprehensive information about an individual at higher risk for re-identification.

Does HIPAA prevent sharing of genomic information based on potential risk for relatives?

Although HIPAA has determined that heritability does not limit the ability of an individual to share their data, with more genomic data being available for analysis, it will be important to continue to monitor the re-identification risk and potential harm to groups and families. This and the factors listed above should be considered when determining the risks for re-identification in sharing genomic information.

Are there other ways to protect privacy and link together genomic data for analysis?

Every effort should be taken to use new approaches to linking and preserving patient privacy. New technologies such as privacy preserving linkages (PPRL) and honest broker governance structures are already being used by the NIH to support research into Covid-19. These approaches provide a way to link patient data without exposing PII, and do so within a governance framework that provides a trusted intermediary that can manage the data and prevent potentially identifiable data from being shared. These approaches could also be used for genomic information that when combined with expert determination of the original and linked datasets, can assure that the re-identification risk is low.

Can restricted enclaves or repositories help to protect patient information?

Genomic data can and should be shared within repositories so that other researchers can build on that research. As described above, expert determination (with assessment of mutation frequency, tumor vs somatic cells, sequence length, etc) can determine the risk of data submitted to a repository. Data deemed potentially identifiable should be treated as full PII data, and maintained in repositories and enclaves that control the use of that data. Restrictions on downloads and working within highly secure enclaves can reduce the chance that data would be removed from the repository and potentially re-identified.

Even data that is deemed low risk should be treated cautiously and every effort should be made to reduce the likelihood that genomic data is linked to other datasets that make re-identification possible. It is possible to both link data into enclaves (without allowing identifiable data to be removed from the enclave) and PPRL tokens (or other privacy-preserving linkage technology) can be used to link data without requiring identifiable data to be shared.

Is there any recent experience in using PPRL techniques to link data together while protecting a patient’s privacy?

The recent experience with the National Covid Cohort Collaborative (N3C) suggests that using honest broker intermediaries can support linking data in ways that protect patient privacy. For example, tokens (without identifiable information) may be allowed to leave a repository enclave, and if linkable data is found, that data can be moved into the enclave to offer higher security and access controls. Honest brokers, coupled with enhanced IRB review, and appropriate security and access control processes, can minimize the risk of re-identification, while providing more value to the research community.

How do we create a process that continually improves the way in which we protect (and share) genomic data?

For investigators that are using genomic data, data sharing plans submitted with grant applications should make these plans a scored element. This would increase the attention paid to these data sharing plans, it would incentivize novel ways to share data, and would make the importance of safely sharing genomic information front and center for a research investigator. Over time, researchers will develop new and better ways to share and link data, and have the success of those approaches evaluated through peer review. The NIH will send an important message to the research community, and will only invest in research that makes data protection and data sharing a priority.

Next: The NIH RFI on genomic data sharing –Consent and genomic data sharing and linking
Previous: The NIH RFI on genomic data sharing — Historical context